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In  recent  years,  moments  and  their  uses  have  been  investigated  by  mathematicians, 
statisticians,  and  engineers.  In  19$7,  the  American  Mathematical  Society  sponsored  a  short 
course  on  “Moments  in  Mathematics”  at  its  meeting  in  San  Antonio,  Texas.  This  led  to 
a  volume  containing  the  six  papers  delivered  there.  The  volume  was  published  by  the 
Society  in  its  Short  Course  Series  as  Volume  37  in  its  Proceedings  of  Symposia  in  Applied 
Mathematics. 

Recently,  Dr.  James  Maar  of  the  National  Security  Agency  noted  a  number  of 
problems  in  signal  processing  in  which  moments  of  distributions  were  important,  and  yet 
statisticians  and  signal  processor  scientists  were  unaware  of  what  had  been  accomplished 
by  each  other.  He  initiated  discussions  with  Professor  Peter  Purdue  of  the  Operations 
Research  Department  of  the  Naval  Postgraduate  School  and  Professor  Herbert  Solomon  of 
the  Statistics  Department  at  Stanford  University  about  developing  a  conference  in  which 
moments  and  signal  processing  and  their  interaction  would  be  featured.  Professor  Purdue 
and  Professor  Solomon  agreed  to  explore  this  idea  and  they  developed  and  co-chaired  a 
Conference  on  Moments  and  Signal  Processing  which  was  held  at  the  Naval  Postgraduate 
School  on  March  30-31,  1992.  The  Proceedings  herein  resulted  from  that  conference. 

The  Conference  developed  around  eight  speakers  whose  interests  include  moments 
and  statistics,  signal  processing,  and  interactions  between  the  two.  Professors  Jerry  Mendel 
and  Max  Nikias  came  from  the  signal  processing  community;  Professors  Satish  Iyengar  and 
Michael  Stephens  came  from  the  statistical  community.  The  remaining  four,  Professors 
David  Brillingcr,  Ken-Shin  Lii,  Bruce  Lindsay,  and  Ed  Wegman,  came  at  the  subject  in 
different  shadings  emanating  from  the  central  core  of  the  Conference. 

The  Conference  was  supported  substantively  by  the  National  Security  Agency 
anti  partially  by  the  Office  of  Naval  Research.  Many  thanks  arc  due  to  these  agencies.  A 
number  of  government  scientists  from  the  Department  of  Defense  and  a  limited  number  of 
general  community  attendees  participated  in  the  Conference.  This  led  to  a  lively  audience 
of  -10  to  50  participants  over  the  two  day  period. 

It  is  hoped  that  the  wide  availability  of  the  papers  in  this  report  will  lead  to  more 
communication  between  the  two  communities  and  of  course  within  each  group. 
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ABSTRACT 


This  tutorial  paper  is  focused  on  two  topics,  namely:  (i)  to  describe  system¬ 
atic  methodologies  for  selecting  nonlinear  transformations  for  blind  equal¬ 
ization  algorithms  (and  thus  new  types  of  cumulants),  and  (ii)  to  give  an 
overview  of  the  existing  blind  equalization  algorithms  and  point  out  their 
strengths  as  well  as  weaknesses.  It  is  shown  in  this  paper  that  all  blind 
equalization  algorithms  belong  in  one  of  the  following  three  categories,  de¬ 
pending  where  the  nonlinear  transformation  is  being  applied  on  the  data: 
(i)  the  Bussgang  algorithms,  where  the  nonlinearity  is  in  the  output  of  the 
adaptive  equalization  filter;  (ii)  the  polyspectra  (or  Higher-Order  Spectra) 
algorithms,  where  the  nonlinearity  is  in  the  input  of  the  adaptive  equal¬ 
ization  filter;  and  (iii)  the  algorithms  where  the  nonlinearity  is  inside  the 
adaptive  filter,  i.<_.,  the  nonlinear  filter  or  neural  network.  We  describe 
methodologies  for  selecting  nonlinear  transformations  based  on  various  op¬ 
timality  criteria  such  as  MSE  or  MAP.  We  illustrate  that  such  existing  al¬ 
gorithms  as  Sato,  Benveniste-Goursat,  Godard  or  CMA,  Stop-and-Go  and 
Donoho  are  indeed  special  cases  of  the  Bussgang  family  of  techniques  when 
the  nonlinearity  is  memoryless.  We  present  results  that  demonstrate  the 
polyspectra-based  algorithms  exhibit  faster  convergence  rate  than  Bussgang 
algorithms.  However,  this  improved  performance  is  at  the  expense  of  more 
computations  per  iteration.  We  also  show  that  blind  equalizers  based  on 
nonlinear  filters  or  neural  networks  are  more  suited  for  channels  that  have 
nonlinear  distortions. 

The  Godard  or  CMA  algorithm  is  probably  the  most  widely  used  blind 
equalizer  in  digital  communications  today  due  to  its  simplicity,  low  complex¬ 
ity  and  constant  modulus  property.  Its  main  drawbacks,  however,  are  slow 
convergence  and  no  guarantee  for  global  convergence  starting  from  arbitrary 
initial  guess.  We  present  a  new  method  for  blind  equalization,  the  CRIMNO 
algorithm  (i.e.,  criterion  with  memory  nonlinearity),  which  is  shown  to  have 
the  same  advantages  as  Godard  (simplicity,  low  complexity,  constant  modu¬ 
lus  property)  and  yet  guaranteeing  much  faster  convergence.  The  CRIMNO 
algorithm  is  flexible  enough  to  address  blind  deconvolution  problems  when 
the  input  sequence  is  colored. 
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1  INTRODUCTION 


Blind  deconvolution  or  equalization  is  a  signal  processing  procedure  that  recovers  the  input 
sequence  applied  to  a  linear  time  invariant  nonminimum  phase  system  from  its  output  only. 
Blind  equalization  algorithms  are  essentially  adaptive  filtering  algorithms  designed  in  such  a  way 
that  they  do  not  need  the  external  supply  of  a  desired  response  to  generate  the  error  signal  in 
the  output  of  the  adaptive  filter.  In  other  words,  the  adaptive  algorithm  is  “blind”  to  the  desired 
response.  However,  the  algorithm  itself  generates  the  desired  response  by  applying  a  nonlinear 
transformation  on  sequences  involved  in  the  adaptation  process.  All  blind  equalization  algorithms 
belong  to  one  of  the  following  three  categories,  depending  where  the  nonlinear  transformation  is 
being  applied  on  the  data: 

•  The  Bussgang  algorithms,  where  the  nonlinearity  is  in  the  output  of  the  adaptive  equal¬ 
ization  filter; 

•  The  Polyspectra  (or  Higher-Order  Spectra)  algorithms,  where  the  nonlinearity  is  in  the 
input  of  the  adaptive  equalization  filter; 

•  The  algorithms  where  the  nonlinearity  is  inside  the  adaptive  filter;  i.e.,  the  filter  is  non¬ 
linear  (e.g.  Volterra)  or  neural  network. 

The  purpose  of  this  paper  is  to  provide  an  overview  of  the  existing  blind  equalization  algo¬ 
rithms  and  to  discuss  their  advantages  and  limitations.  Conventional  equalization  and  carrier 
recovery  techniques  used  in  multilevel  digital  communication  systems  usually  require  an  initial 
training  period,  during  which  a  known  data  sequence  (i.e.,  training  sequence)  is  transmitted  [43], 
[45].  An  alternative  effective  approach  to  this  problem  is  to  utilize  blind  equalizers  which  do  not 
require  any  known  training  sequence  during  the  startup  period. 
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The  paper  describes  systematic  methodologies  for  selecting  the  nonlinearity  based  on  various 
optimality  criteria,  such  as  maximum  likelihood  (ML),  mean-square  error  (MSE)  or  maximum 
a  posteriori  (MAP).  As  an  example,  it  is  illustrated  that  such  existing  algorithms  as  Sato  [46], 
[47]  Benveniste-Goursat  [5],  [6]  Godard  or  CMA  [22],  [50]  and  Stop-and-Go  [41]  are  indeed  spe¬ 
cial  cases  of  the  family  of  Bussgang  techniques  where  the  nonlinearity  is  memoryless  [3],  [4].  It 
is  demonstrated  that  the  polyspectra-based  algorithms  exhibit  faster  convergence  rate  than  the 
Bussgang  algorithms.  However,  this  improved  performance  is  at  the  expense  of  more  computa¬ 
tional  complexity.  On  the  other  hand,  blind  equalizers  based  on  nonlinear  filters  are  well  suited 
for  channels  that  have  nonlinear  distortions  [39],  [40], 

The  Godard  algorithm  is  probably  the  most  widely  used  blind  equalizer  in  digital  communica¬ 
tions  today  due  to  its  simplicity,  low  computational  complexity,  and  constant  modulus  property. 
Its  main  drawbacks,  however,  is  slow  convergence  and  no  guarantee  for  global  convergence  (con¬ 
vergence  starting  from  arbitrary  initial  guess).  The  paper  describes  the  development  of  the 
CR.IMNO  algorithm  (i.e.,  criterion  with  memory  nonlinearity)  which  is  shown  to  have  the  same 
advantages  as  Godard  algorithm  (simplicity,  low  complexity,  constant  modulus  property)  and  yet 
guaranteeing  much  faster  convergence  [12],  [13].  Extension  of  the  CRIMNO  algorithm  to  the  case 
of  colored  input  signals  is  also  presented. 

The  polyspectra-based  adaptive  blind  equalization  algorithms  are  also  described  in  the  pa¬ 
per.  In  particular,  the  Tricepstrum  Equalization  Algorithm  (TEA)  [24],  the  Power  Cepstrum 
and  Tricoherence  Equalization  Algorithm  (POTEA)  [7],  and  the  Cross-Tricepstrum  Equalization 
Algorithm  (CTEA)  [8]  are  presented,  as  well  as  their  advantages  and  limitations.  It  is  shown 
that  these  algorithms  perform  simultaneous  identification  and  equalization  of  a  nonminimum 
phase  communication  channel  from  its  output  only.  Simulations  with  PAM  and  QAM  signals 
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demonstrate  the  effectiveness  of  the  polyspectra- based  algorithms. 

Finally,  the  paper  provides  an  overview  of  the  neural  network  based  adaptive  equalization 
algorithms  either  with  or  without  a  training  sequence  [11],  [20],  [26],  [27],  [39],  [40],  [49]. 

2  DEFINITION  OF  BLIND  EQUALIZATION  PROBLEM 

Let  us  consider  the  discrete-time  linear  transmission  channel  whose  impulse  response  {/(i)}  is 
unknown  and  possibly  time-varying.  The  input  data  (z(i)}  are  assumed  to  be  independent  and 
identically  distributed  (i.i.d.)  random  variables,  with  non-Gaussian  probability  density  function. 
Let  us  also  assume,  without  loss  of  generality,  that  the  sequence  { a: ( i ) }  has  mean  £ { a: ( i ) }  —  0 
and  variance  £{|x(t')|2}  =  Qt •  If  x(0  is  real,  we  may  drop  the  magnitude  function  and  simply 
write  E{xz(i)}.  Initially,  noise  is  not  taken  into  account  in  the  output  of  the  channel.  From 
Figure  2.1,  it  follows  that  the  model  we  consider  is 

y(i)  =  f(i)  *  x (i) 

=  £/(*)x(i-*)  (2.1) 

k 

where  denotes  linear  convolution  and  (y(i)}  IS  the  received  sequence.  The  problem  is  to  recon¬ 
struct  (or  restore)  the  input  sequence  { x ( i ) }  from  the  received  sequence  { y( *)}  or,  equivalently, 
to  identify  the  inverse  filter  (equalizer)  {«(!)}  for  the  channel. 

From  Figure  2.1,  we  see  that  the  output  sequence  {i(0}  of  the  equalizer  is  given  by 


x(i)  =  u(i )  *  y(0 
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=  «(2')  *  (/(*')  *  *(0) 

=  ti(i)  *  f(i)  *  x(i).  (2.2) 

So,  to  achieve 

x(i)  =  x(i—D)ejg  (2.3) 

where  D  is  a  constant  delay  and  9  is  a  constant  phase  shift,  it  is  required  that 


u(i)  *  f(i)  =  S(i  -  D) 


(2.4) 


where 


S(i) 


(l)(e^),  i  =  0 

< 

0,  otherwise. 


Performing  the  Fourier  transform  on  (2.4),  we  obtain 


U(w)-F(w)  =  eJ<  ; 


(2.5) 


In  other  words,  the  objective  of  the  equalizer  is  to  achieve  a  transfer  function 


U(w) 


1  j(e-wD ) 

F&j6 


(2.6) 


In  general,  D  and  9  are  unknown.  However,  the  constant  delay  D  does  not  affect  the  reconstruc¬ 
tion  of  the  original  input  sequence  (x(i)}.  The  constant  phase  shift  9  can  be  removed  by  a  carry 
recovery  technique.  As  such,  in  the  sequel,  it  will  be  assumed  that  D  =  0  and  9  =  0. 

Blind  equalization  schemes  may  be  classified  into  three  categories;  i.e.,  those  which  utilize 
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nonlinearities  in  the  output  of  the  adaptive  equalization  filter,  those  which  place  the  nonlinearity 
in  the  input  of  the  adaptive  equalization  filter,  and  those  which  utilize  adaptive  nonlinear  equal¬ 
ization  filters.  The  Bussgang  equalization  algorithms  with  memoryless  or  memory  nonlinearity 
belong  to  the  first  category  whereas  the  higher-order  cumulant-based  equalizers  (TEA,  POTEA, 
etc.)  belong  to  the  second  category,  as  they  perform  memory  nonlinear  transformation  on  the 
input  data  of  the  equalization  filter.  Blind  equalizers  based  on  nonlinear  filters,  such  as  the 
Volterra  filter  or  neural  networks,  belong  to  the  third  category.  Figures  2.2  (a)-(c)  illustrate  the 
block  diagrams  of  the  aforementioned  three  families  of  blind  equalizers. 

3  PERFORMANCE  MEASURES  FOR  ALGORITHM  EVAL¬ 
UATION 

Four  different  performance  measures  are  usually  considered  in  simulation  experiments  for  the 
testing  of  the  blind  equalization  algorithms:  the  time-average  squared  error  (E^gjr),  the  tran¬ 
sitional  symbol  error  rate  (SER),  the  residual  intersymbol  interference  (ISI)  and  the  discrete  eye 
patterns  [43],  [44].  They  are  defined  as  follows. 

Time-Average  Squared  Error(E^gg  or  MSE) 

At  iteration  (i),  the  mean  square  error  in  the  output  of  the  equalizer  is  defined  as  : 

ease  =  jf  E  !*<•'  -  -  *(0P  (3-D 

1=1 

where  x(i)  is  the  output  of  the  equalizer  at  iteration  (i)  and  x(i  —  D )  is  the  corresponding  true 
value.  Note  that  the  delay  D,  which  is  introduced  by  the  channel  and  the  equalizer,  does  not 
affect  the  recovery  of  the  original  information  {z(i)}.  However,  it  must  be  taken  into  account  in 
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the  calculation  of  MSE  (i).  The  MSL  (i)  gives  a  measure  of  both  the  noise  and  residual  ISI  at 
the  output  of  the  equalizer. 

Transitional  Symbol  Error  Rate  (SER) 

The  SER  indicates  the  percentage  of  wrongly  detected  symbols  in  consecutive  intervals  of  500 
symbols,  i.e., 


SER 


#of  wrong  detections  in  500  symbols 
500 


(3-2) 


Residual  ISI 

The  residual  ISI  in  the  output  of  equalizer  is  defined  as  follows.  Let  {/(i)}  be  the  channel  impulse 
response  and  {u(i)}  the  equalizer  tap  coefficients  at  iteration  (i).  Let  s(i)  =  f(i)*  u(i),  then 


ISI(z) 


Y2,  MOI2  -  max{|s(t)|2} 

max{|s(0|2} 


(3.3) 


Physically,  this  indicates  the  amount  of  ISI  present  at  the  output  of  the  equalizer  due  to  imperfect 
equalization. 

Discrete  eye  patterns 

Discrete  eye  patterns  (or  equalized  signal  constellation)  consist  of  all  possible  values  of  the  output 
of  the  equalizer,  x(i),  at  iteration  (i),  drawn  in  two-dimensional  space.  We  say  that  the 
eye  pattern  is  open  whenever  the  ideal  decoding  thresholds  are  easily  distinguishable  between 
neighboring  equalized  states. 

In  our  simulations,  all  performance  measures  were  calculated  for  many  independent  signal 
and  noise  realizations.  For  the  E^gjr,  time  averaging  over  100  samples  were  performed  for  each 
realization.  The  eye  pattern  at  iteration  (i)  was  obtained  by  drawing  the  output  of  equalizer  for  all 
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independent  realizations  and  for  a  specific  number  of  samples  (for  each  realization)  symmetrically 
located  around  (i). 

4  ALGORITHMS  WITH  NONLINEARITY  IN  THE  OUT¬ 
PUT  OF  THE  EQUALIZATION  FILTER 

Let  us  assume  that  a  guess  for  the  impulse  response  of  the  inverse  filter  (equalizer),  ug(i)  has 
been  selected.  Then, 


ug(i)  *  f(i)  -  S(i)  +  e(i)  (4.1) 

where  t(i)  accounts  for  the  difference  (error)  between  our  guess  ua(i)  and  the  actual  values  of 
u(i).  If  we  convolve  the  initial  guess  of  the  inverse  filter,  {us(i)},  with  the  received  sequence, 
{y(i)},  we  obtain 


x{i)  =  y(i)  *  ug{i) 

=  x(i)  *  f(i)  *  ug(i).  (4.2) 

Combining  (4.2)  with(4.1),  we  obtain 

x  -  x(i)  *  (S(i)  +  €(i)) 

=  [x(i)  *  <5(i)j  +  [x(0  *  e(i)] 

=  x(t)  +  n(i)  (4.3) 


10 


where 


n(i)  =  x(i )  *  e(i)  (4-4) 

is  the  “convolutional  noise”,  namely,  the  residual  ISI  arising  from  the  difference  between  our 
guess  ug(i)  and  the  actual  inverse  filter  u(i). 

Our  problem  now  is  to  utilize  the  deconvolved  sequence  {£(*’)}  to  find  the  “best”  estimate  of 
{i(i)};  namely,  {d(i)}.  Note  that  in  adaptive-filter  literature  d(i)  is  used  to  represent  the  desired 
response  [25].  Two  criteria  are  employed  to  determine  the  “best”  estimate  of  x(i)  from  the  given 
x(i)  .  These  are  the  mean-square  error  (MSE)  and  maximum  a  posteriori  (MAP). 

Since  the  transmitted  sequence  x(i)  has  a  non-Gaussian  probability  density  function,  the  MSE 
and  MAP  estimates  are  nonlinear  transformations  of  x(i).  In  general,  the  “best”  estimate  d(i)  is 
given  by  [3],  [4],  [23],  [54]. 

d(i)  ~  <7[i(i)]  (memoryless) 

or 

d(i)  =  5[i(i),i(i  -  1), . .  .,x(i  —  to)]  (mth  -  order  memory)  (4.5) 

where  <?[•]  is  a  nonlinear  function  with  or  without  memory.  The  d(i)  is  fed  back  into  the  adaptive 
equalization  filter  as  shown  in  Figure  4.1.  From  this  figure,  it  is  also  apparent  that  the  nonlinear 
function  g[-]  is  in  the  output  of  the  equalization  filter. 

4.1  Optimum  Selection  of  Nonlinearities 

4.1.1  Nonlinearities  with  MSE  Estimates 

In  summary,  a  well  treated  classical  estimation  problem  is  as  follows: 
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x(i)  =  x(i)  +  n(i) 


(4.6) 


where 

(i)  n(i)  is  Gaussian.  Note  that  if  e(i )  in  (4.4)  is  long  enough,  the  central  limit  theorem  makes 
the  Gaussianity  assumption  for  n(i)  reasonable. 

(ii)  {i(i)}  are  independent,  identically  distributed  (i.i.d.)  and  in  general  non-Gaussian.  The 
pdf  of  x(i)  is  known;  in  digital  communications  the  (x(i)}  are  usually  equi-probable  discrete 
signal  points. 

(iii)  x(i)  and  n(i)  are  assumed  independent. 

Given  the  x(i),  we  seek  the  MSE  estimate  of  x(i),  namely,  dmse(*)- 

From  Van  Trees  [52,  p.  58],  it  follows  that  the  best  MSE  estimate  of  { a: ( i ) }  given  {z(i)}  is 
the  mean  of  the  a  posteriori  density,  i.e., 


/+oo 

dx  xPx/i(x/i) 

■OO 

=  E{x{i)/x(i)}.  (4.7) 

where  Px/X{xfx)  =  is  the  a  posteriori  density;  Px/i{x/x)  is  Gaussian,  iV(z(i),Qn), 

with  Qn  being  the  variance  of  { n ( * ) } ;  the  a  priori  density  Px(x)  is  the  pdf  of  x(i),  and  Px/x(i) 
behaves  as  a  normalization  constant  in  the  integral  of  (4.7). 

If  j(i)  is  zero-mean  Gaussian  with  variance  Qx;  i.e.,  Px(x )  is  N(0,Qx),  (4.7)  reduces  to 
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(4.8) 


^mse(i) 


Qx 

Qx  +  Qn 


■*(0 


which,  in  turn,  implies  that  <7[x(i)]  is  a  linear  function.  However,  when  Px(x)  is  non-Gaussian, 
the  integral  (4.7)  can  not  be  reduced  to  a  simple  expression  and  g(-]  will  be  a  nonlinear  function. 
In  the  sequel,  we  show  dmse(i)  versus  x(i)  when  pdf  Px(x )  is  uniform  and  Laplace. 

Uniform  Distribution 
The  a  priori  pdf  is  given  by 


Px(x) 


Tx 


(4.9) 


0,  otherwise. 


Consequently,  the  a  posteriori  pdf  takes  the  form 


where 


Px/iixlx ) 


Bt(x) 


—A  <  x  <  A 


0,  otherwise. 


(*-*)2 
2  Qn 

B\{x)  =  J  A\(x)dx. 


Ai(x,x)  = 


1 


2A  y/2irQn 


exp 


(4.10) 


Substituting  (4.10)  into  (4.7),  we  obtain  dmse(i)  as  a  function  of  x.  Ho%vever,  this  relationship  is 
not  easy  to  express  analytically  and  is  obtained  by  numerical  integration  as  shown  in  Figure  4.2. 
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Laplace  Distribution 


The  a  priori  density  is  given  by 


Px{x)  =  -exp[-A|z|] 


(4.11) 


and  thus  the  a  posteriori  density  takes  the  form 


Px/i(xjx)  = 


M(x,x) 

B2{x) 


(4.12) 


where 


A2(x,i)  = 


/+OO 

A2(x)dx. 

-OO 


Combining  (4.12)  with  (4.7)  and  using  numerical  integration  we  obtain  <£mse  vs  z  as  shown  in 
Figure  4.3. 


4.1.2  Nonlinearities  with  MAP  Estimates 


In  this  section  we  treat  the  estimation  problem 


z(i)  =  z(i)  +  n(i) 


where  n(i)  is  Gaussian  and  x(i)  is  i.i.d.  non-Gaussian.  However,  we  seek  MAP  estimate  of  x(i), 


namely  dmap(i)  when  n(i)  is  white  or  colored,  or  correlated  with  z(i).  The  colored  noise  case, 


as  well  as  the  case  of  correlated  noise  with  x(i),  will  result  into  a  memory  nonlinear  relationship 

between  dm ap  and  x(i);  i.c.,  rfniap(0  =  <7[x(i),x(i  -  1) . x(i  -  m)].  If  x(i)  is  Gaussian  i.i.d. 

and  n(t)  is  white  Gaussian,  independent  from  x(i),  then  the  dmap(0  is  identical  to  dmse(0  and 
is  given  by  (4.S). 

If  we  denote  x  =  [x(i),x(i  -  1), . .  .,x(l)]  and  x  =  (x(i),  x(i  -  1), . . . ,  x(l)],  then  a  posteriori 
pdf  is  given  bv  Van  Trees  [p.  58] 


PtjAlU) 


Pt(x)  •  Pg/£(j/j) 

P(i) 


(4.13) 


and  the  MAP  estimate,  dmap-  of  £  given  x  is  the  value  of  x  which  maximizes  £(x),  where 


f(x)  =  ?nPi/x(x/x)  +  tnPx(x). 


(4.14) 


where  the  denominator  of  (4.13)  does  not  contribute  to  the  maximization  of  £(x). 

CASE  I:  White  Gaussian  Noise 

In  this  case  the  n(i)  is  white,  Gaussian  iV(0,Qn),  and  independent  of  x(i).  It  is  also  assumed 
that  {x(z)}  are  i.i.d.  and  non-Gaussian.  Consequently,  joint  pdfs  are  expressed  as  products  of 
marginal  pdfs  and  the  MAP  estimate  at  each  iteration  {i},  dmap(0>  is  obtained  by  maximizing 


£{x(i))  =  tnPij:(x/x)  +  tnP:(x). 

That  is  to  say  that  the  estimation  problem  is  decoupled  and  the  resulting  relationship 
(fman(0  vs  £(»’)■  is  memoryless. 

The  following  rnemoryless  nonlinearities  can  be  derived. 
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(i)  Uniform  Distribution  ( -1 .1) ) 


-A, 

x(t)  <  -A 

•r(i). 

—  A  <  x(i)  <  A 

A, 

*{*)  >  X 

(4- >5) 


N ot o  tli.it  (I map  dors  not  depend  on  Qn. 
(ii)  Laplace  Distribution  (4.11) 


i(i)  +  *Q x(i)  <  -\Qn 

rfmap(«)  =  0,  -AQn  <  i(i)  <  AQn 

i(0  -  x(i)  >  XQn. 


('U(i) 


H**ro  tlu>  MAP  estimate  depends  on  Qn.  For  the  symmetric  uniform  and  Laplace  a  priori  distri¬ 
butions  the  resulting  a  posteriori  pdf,  is  asymmetric. 

Figures  u.4  and  4..r>  illustrate  the  MAP  memoryless  nonlinearities. 

CASK  II:  Colored  Gaussian  Noise 

In  this  case  we  assume  that  n(i)  is  colored  Gaussian  N( 0,£)  where  £  is  m  x  m  correlation 
matrix.  On  the  other  hand.  {u(»)}.  Based  on  these  assumptions,  the  numerator  of  (4.13)  is 


r* u)  •  %£U/£)  =  IlW))  -Wi/i) 


(4.17) 
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and 


n =  [Px(x)]m. 

1=1 


For  mathematicaJ  tractahililv,  wo  consider  the  cast*  m  =  2  and  derive  the  memory  nonlinear 
relationships  rfm;i.»(i)  vs  .r. 

For  m  —  2  the  correlation  matrix  takes  the  form 


R=Qn- 


(  \ 
1  P 


\P  lJ 


-  \p\  <  l- 


For  simplicity,  we  also  define  the  following  vectors 


/  \ 

(  \ 

Xl 

A 

x(i) 

v  f2  ) 

K  ^  ) 

(  \ 

(  X  \ 

A 

x(i) 

v 12 ) 

{  x(»-  1)  J 

(i)  Uniform  Distribution  (4.9) 

Maximizing  (4.17)  is  equivalent  here  to  minimizing 


(4.18) 


(4.19) 


J  =  {£  —  x)TR  1(x-  x)  (4.20) 

with  the  restrictions  —A  <  Xi  <  A,  -A  <  X2  <  A.  Hence,  we  seek  a  point  in  the  area 
A'2  =  { (  j  1 ,  x 2 )  :  -  A  <  Xi  <  A,  -A  <  z;  <  A}  such  that  J  is  minimized.  Differentiating  J 
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with  respect  to  j-i  and  x2  and  setting  the  derivative  to  zero  we  obtain 


(j-,  -  j  ,)  -  f>(x,  -  x-2)  =  0 

(x.  -  x2)  -  p(ii  -  X!)  =  0.  ('1-21) 


From  (1.21),  it  is  apparent,  that  if  xi.\>,  that  is  —A  <  i[  <  A  and  —A  <  x2  <  A,  then 


t^imap  - 


dim; 


ip 


y  d>map  J 


(  \ 


\  / 


for  Xi  X2 


(4.22) 


when  x  is  outside  .\  >,  the  minimum  is  achieved  on  the  boundary  of  X2.  That  is 


dmiap  =  A-  •  A  •  sgn[x,j  T  (1  -  A)/C[x,  -  p(x2  -  Asgn[x-2])] 
dzmap  =  (1  -  A)  •  A  •  sgn[i2]  +  A  •  fc[i2  -  p(x,  -  Asgn[f ,])] 

for  x  (X2  (4.23) 


where 


A,  x  >  A 


fc{x) 


X, 


1*1  <  A 


-A,  x  <  -A. 


(ii)  Laplace  Distribution  (4.11) 

To  obtain  the  MAP  estimate  is  equivalent  to  minimize 


(4.24) 


j  =  a|xj|  +  a|x2!  +  ^[(i  -  x)tr  l(i  -  *)]. 


(4.25) 
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The  necessary  conditions  are 


Asgnfxj]  +  c(i!  -  ij )  -  cp(x2  -  x2)  =  0 

Asgn[x2]  +  c(x2  -  x2)  -  cp(xi  -  xi)  =  0.  (4.26) 

where  c  =  q  (i-p*)'  Clearly,  (4.26)  is  a  nonlinear  system  of  equations.  Two  special  cases 

are  the  following:  1)  when  -A/c  <  xj  -  px2,  x2  -  pi\  <  A/c,  then  dimap  =  0,  and  2) 

when  p  —  0,  the  problem  reduces  to  the  case  of  white  Gaussian  noise. 

4.2  The  Bussgang  Algorithms 

Fig.  4.1  illustrates  the  Bussgang  adaptive  blind  equalization  algorithms  when  an  LMS  type  or 
stochastic  gradient  algorithm  [53]  is  used  for  the  adaptation  of  the  equalizer  coefficients,  and  the 
nonlinearity  yh)].]  is  memoryless  [3],  [4],  [23].  The  following  equations,  consistent  with  the  block 
diagram  of  Fig.  4.1,  describe  the  Bussgang  family  of  algorithms: 

u(i)  =  [ui(i), . . . ,  u/v(i)]r  equalizer  taps 

u(0)  =  [0, . . . ,  1, . . .,  0]T  initial  tap  values 

y{i)  =  [y(t)i . . .,  y(i  —  N  +  l)]r  input  to  the  equalizer  block  of  data 

i  =  0,1,2,...  iteration  index 

(4.27) 

x(t)  =  uH(i)y(i)  equalizer  output  or  reconstructed  sequence 

d(i)  =  <7^[x(i)]  =  p('l[uw(t)£(i)]  output  of  nonlinearity 

e(i)  =  d(i)  -  x(i)  error  sequence 


u(i  +  1) 


u(i)  +  py(i)  ■  e’(i) 


LMS-type  adaptation 


4.2.1  Convergence  Rate  and  Properties 


From  (4.27)  and  Figure  4.1,  it  is  apparent  that  the  output  sequence  of  the  nonlinear  function, 
d[i),  “plays  the  role”  of  the  desired  response  or  the  training  sequence.  It  is  also 
apparent  that  the  Bussgang  technique  is  simple  to  implement  and  understand,  and  it  may  be 
viewed  as  a  minor  modification  of  the  original  LMS  algorithm  (the  desired  response  of  the  original 
LMS  adaptation  is  a  memoryless  transformation  of  the  transversal  filter  output).  As  such,  it  is 
expected  that  the  technique  will  have  convergence  that  will  depend  or.  the  eigenvalue  spread  of 
the  autocorrelation  matrix  of  the  received  data  {l/(0}- 

From  (4.27),  the  LMS  adaptation  equation  for  the  equalizer  coefficients  is  given  by 

u(t+l)  =  u(i)  -f  fiy(i)  e’(i)  (4. 28) 

If  we  obtain  the  expected  value  (ensemble  averaging)  of  (4.28),  we  have 

E{u{i+  1)}  --  E{u{i)}  +  p£{y(t)  [*(*)]  -  i'(i))} 

=  E{u(i)}  + nE{y(i)g{,y[x(i)]}  -  nE{y(i)xm(i)}.  (4.29) 

The  adaptive  algorithm  converges  in  the  mean  when 

£{g(*)0(,)*[i(O]}  =  £{y(0**(*)}  (equilibrium) 

and  it  converges  in  the  mean-square  when 
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E{x{i)£'{  )}. 


(4.30) 


Finis.  it  is  required  that  the  equalizer  output  x(i)  be  Bussgang  at  equilibrium. 
No i e  that  identity  (4.30)  states  that  the  autocorrelation  of  x(i)  (right-hand  side)  equals  the 
cross  correlation  between  £(i)  and  a  nonlinear  transformation  ofi(i)  (left-hand  side).  Processes 
which  satisfy  property  (4.30)  arc  said  to  be  Bussgang  [10].  In  summary,  the  adaptive  Bussgang 
tec'.!  liijiu's  converge  when  the  equalizer  output  sequence,  { x ( # ) } ,  becomes  Bussgang  (accessary 
■  or.dit  ion). 

A  stochastic  gradient  algorithm  (steepest  descent)  essentially  minimizes  iteratively  a  perfor- 
-ii, men  index  J(i)  -  /C{<7[a,(i)]}  with  respect  to  the  equalizer  coefficients  u(i).  A  more  general 
:  i  ni  of  the  equalizer  taps  adaptation  equation  (4.28)  is  [25] 

n(i  +  1)  =  u(i)  -  /iVuJ(i)  (4.31) 

where  Vu./(i)  is  the  gradient  of  J(i).  Differentiating  J(i)  by  using  the  composite  function  rule, 

we  obtain 

VuJ(i)  =  -JS{V„[x(*)].Vi[G(i(t))]} 

=  -P:{|/(0-Vi[6'(x(0)]}  (4.32) 

By  dropping  the  expectation  operation,  i.c.,  by  using  a  single-point  unbiased  estimate, 

we  i»i)taiii 
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v„J(0 


--£(»>*(») 


(4.33) 


c’(i)  -  Vi[G(x(i))\ 

-  0(,)*[*(O]  ^  *"(*)  (4-44) 

liquation  (4.3.4) shows  the  relationship  between  the  nonlinear  function  </(')[•]  used  in  the  Bussgang 

Techniques  with  the  nonlinear  cost  function  G[*j  which  defines  the  performance  index,  ./[•]. 
Example  for  one  dimensional  modulation  (PAM) 

The  first,  blind  equalization  algorithm  was  introduced  by  Sato  in  1975  [47]  for  PAM  signals.  He 
chose  the  simple  nonlinear  function 

ij(x)  ~  7sgn[i]  (4.35) 

where  7  is  a  gain  parameter  which  must  be  chosen  to  satisfy  the  Bussgang  property  (4.30) 

E(x(i)  -7Sgn[i(i)]}  a  £{|z(t)|*} 

or 

7  -  ^{r(i)|3}//-;{|i(i)|).  (4.30) 
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We  could  also  write  Sato’s  algorithm  in  terms  of 


G{x)  =  h2  -  7|i|  (4.37) 

4.2.2  Extension  to  QAM  modulation 

The  extension  of  Bussgang  algorithms  to  two-dimensional  constellations  (QAM)  is  somewhat 
straightforward  [3],  [4].  In  the  case  of  two  independent  quadrature  carriers,  the  conditional 
mean  estimate  of  an  equivalent  complex  transmitted  symbol  x  given  the  complex  observation 
x  =  x fi  +  ji[  can  be  written  as 

d  =  E{x  jx)  --  +  jg[xi].  (4.38) 

We  keep  the  notation  simple  by  omitting  (i).  For  example,  the  Sato  nonlinearity  for  QAM  signals 
takes  the  form  [47]. 

g(x)  =  7csgn(i)  =  7{sgn[x/i]  +  j  sgn[i /]}.  (4.39) 

It  is  clear  that  real  and  imaginary  parts  of  the  data  can  be  estimated  separately.  The  complex 
data  equivalent  of  the  adaptive  Bussgang  Techniques  is  described  in  (4.27),  but  with 

<7(,)[i(i)]  =  ff(,)[x«(0]  +  j  <7(,)[x/(t)].  (4.40) 

<  'onsequently,  the  error  sequence  is 

e(i)  =  {^(,)[*ii(i)]-i/i(*)}+j{s(,)[*/(i)] -*/(*)}•  (4-41) 
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For  example,  the  ”Stop-and-Go”  algorithm  introduced  by  Picchi  and  Prati  [41]  is  an  adaptive 
Bussgang  technique  with  the  following  nonlinearity 


=  *(*)  +  ^Ax(i)  -  ^Ax(i) 

+  l-Bx-{i)-l~Bx-{i)  (4.42) 

where  x(i)  is  defined  as  the  quantizer  (slicer)  output  in  Figure  4.1  and  (A,  B)  is  a  pair  of  integers 
taking  values  (2,0)  or  (1,1)  or  (1,-1)  or  (0,0).  The  values  of  ( A,B )  are  generally  different  at 
each  iteration,  and  how  they  are  chosen  is  described  later  in  this  section. 

Another  example  of  a  Bussgang  technique  is  the  heuristic  modification  of  the  Sato  algo¬ 
rithm  suggested  by  Benveniste  and  Goursat  [5],  [6].  In  this  case,  the  nonlinear  function  takes  the 
form 

5t[x(j)]  =  x(i)  +  —  k\x(i)  + 

k2\x(i)  -  x(?')J  •  [7csgn[i(i)]  -  x(t)] 

or 

ff[x(i)]  =  x(f)  +  |x(t)  -  x(i)|  {fc1e.iarg(r(,)-*(*)]q. 

fc2[7csgn[f(i)]  -  x(i)]}  (4.43) 

where  &2,  are  constants.  From  (4.38)  we  observe  that  the  Benveniste-Goursat  error  function 
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may  bo  seen  as  a  weighted  sum  of  the  Decision  Directed  (DD)  [-43]  and  Sato  errors.  On  the 
other  hand  the  "Stop-and-Go"  error  function  (4.37)  is  the  weighted  sum  of  the  DD  error  and 
its  conjugate.  The  weights  of  the  two  algorithms,  however,  are  chosen  in  a  completely  different 
manner. 

4.2.3  Unknown  Carrier  Phase:  The  Constant  Modulus  Property 

Equation  (4.33)  can  be  written  in  polar  coordinates  as 

d  —  E  jx  /x }  =  r  e]S  .  (4.44) 

If  we  assume  that  all  rotated  constellations  are  equally  likely,  since  the  carrier  phase  is 
unknown,  then  the  conditional  mean  d  in  (4.39)  has  the  same  argument  as  x,  and  is  given  by 

d  =  $[|i|]  •  e>  “*<*>  (4.45) 

where  <)(•]  is  a  nonlinear  function  and  |x|  =  \Ji2R  +  x2,  arg(x)  =  arctan[x//x/j].  Combining  (4.39) 
with  (4.40)  we  obtain  [3],  [4],  [23] 


e(i)  =  d(i)  -  x(i) 


=  5[|x(0|]eJ"8(i(,)]-x(*) 


=  *(*')[ 


g[|j(0|] 

|x(f)| 


!]• 


(4.46) 


Hence,  the  error  term  is  independent  of  any  fixed  phase  rotation  of  the  signal  constellation. 
Equation  (4.27)  also  represents  the  Bussgang  technique  for  the  case  of  unknown  carrier  phase, 
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provided  we  substitute  e(i)  in  (4.27)  by  e(i)  of  (4.41). 

Example:  The  Godard  (or  CM  A)  Algorithm  [22],  [50] 

Under  the  assumption  that  all  rotated  constellations  are  equally  likely,  Godard  [22]  suggested 
that  <7[|x|]  in  (4.41)  be  chosen  as 


m]  =  i*i  +  Rp\ *r 1  -  i*i2p~x 


(4.47) 


where  Rp  is  a  real  constant.  As  we  shall  see  this  form  has  some  very  nice  properties.  Special 
cases  of  (4.42)  include 

§[\i\]  =  (l  +  /Z2)|x|-|i|3  (p  —  2) 

and 

g([|*ID  =  R\  (P^1)- 

The  parameter  Rp  is  a  gain  constant  which  has  to  be  chosen  according  to  (4.30).  Since 


$[*(*)]  = 


*(Qg[i*(on 


(4.48) 


combining  (4.43)  with  (4  30),  we  obtain 


£{|x(i)|2  +  i7p|x(i)|p-|x(i)|2p}  =  £{l*(0|2} 


or 


_  g(t*(t)i2n 

p  £{|x(»)lp) 


(4.49) 
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At  perfect  equalization,  x(i)  =  x(i)ej9  (assuming  time  delay  D  =  0),  and  thus 

-ftp  =  where  mp  =  £{|z(i)|p}. 

Combining  (4.34)  and  4.43),  we  obtain  the  Godard  performance  index  nonlinearity,  namely, 

G(i(0)  =  ^(|*(»)IP-£P)  (4.50) 

Fig.  4.6  summarizes  the  nonlinear  functions  of  the  Bussgang  iterative  techniques. 

4.2.4  The  Sato  and  Benveniste-Goursat  Algorithms 

Sato  [46]  introduced  the  first  blind  equalization  scheme  in  1975  by  introducing  the  sign  non¬ 
linearity  to  generate  the  desired  response  of  the  adaptive  scheme  shown  in  Figure  4.1,  i.e., 
d(i)  =  7  sgn  [*(*)].  In  1986,  Sato  [47]  extended  his  1-D  PAM  algorithm  to  the  multidimensional 
blind  equalization  problem  where  all  transmitted  signals  become  vector  processes  and  all  impulse 
responses  (channel  and  equalizer)  are  square  matrices.  The  extension,  however,  is  straightfor¬ 
ward.  For  example,  in  the  two-dimensional  case  of  QAM  signals  the  “sign”  nonlinearity  becomes 
the  “complex  sign”  defined  by  (4.34).  The  error  signal  of  the  Sato  algorithm 

es(i)  =  7  cgn  [x(i)]  -  i(i)  (4.51) 

is  very  noisy  around  the  solution  unless  the  transmitted  sequence  x(i)  takes  only  the  values  ±1. 
In  other  words,  although  e3(i)  is  zero-mean  at  the  solution,  it  has  a  large  variance.  On  the  other 
hand,  the  Decision  Directed  (DD)  error  signal  eo(i)  —  x(i)— i(i)  (  see  Figure  4.6)  [33],  though  not 
robost  for  blind  equalizers,  enjoys  the  property  of  being  identically  zero  at  the  solution.  Hence, 
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Benveniste-Goursat  [5]  suggested  the  idea  of  combining  (heuristically)  both  error  signals  in  the 
form  of  a  weighted  averaging  as  follows 

eBc(»)  =  ki  eo(i )  +  k?  es(i)  |cd(0I  (4.52) 

where  are  constants.  The  rationale  behind  the  error  expression  (4.47)  is  the  following. 

Before  the  eye  of  the  equalizer  opens,  jc/?(*)j  is  large  and  thus  the  Sato  error  es(i)  contributes  to 
the  proper  direction.  At  the  opening  of  the  eye  and  thereafter  | e £? ( * ) |  becomes  small  and  the  DD 
mode  of  the  error  «bg(1)  takes  over  to  speed  up  convergence  and  to  achieve  faster  rate  than  the 
original  Sato  algorithm  with  e5(x).  It  is  no  wonder,  therefore,  that  in  our  simulation  experience 
we  have  seen  the  Benveniste-Goursat  (BG)  algorithm  exhibiting  initially  very  slow  convergence. 

A  faster  convergence  rate  has  been  observed  only  after  the  eye  opens.  The  Benveniste-Goursat 
algorithm  may  be  seen  as  the  Sato  algorithm  that  switches  automatically  to  a  DD  one  when  the 
eye  of  the  equalizer  opens.  The  extension  of  the  Benveniste-Goursat  algorithm  to  a  Decision 
Feedback  Equalization  (DFE)  implementation  [2]  was  given  by  Macchi  et  al.  [32]. 

4.2.5  The  Godard  and  Donoho  (or  Shalvi- Weinstein)  Algorithms 

The  basic  motivation  behind  the  development  of  Godard’s  algorithm  introduced  in  1980  [22]  was 
to  find  a  cost  function  that  characterizes  the  amount  of  ISI  at  the  equalizer  output  independently 
of  the  carrier  phase.  Since  the  input  sequence  z(»)  is  i.i.d.,  the  cost  function  that  satisfies  the 
aforementioned  conditions  is 


jm  =  £<(ii(OM*(or)'}. 


(4.53) 
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which  depends  on  the  input  sequence.  For  p  =  2,  and  q  =  2,  J ^  takes  the  form 


J(2)  =  £{|x(i)|4  +  ix(t)|4-2|i(t)|2|i(i)|2}  (4.54) 

where  we  assume  that  E{x2(i )}  =  0.  However,  (4.48)  or  (4.49)  can  not  be  used  in  practice  because 
{x(i)}  is  inaccessible.  To  avoid  this  difficulty,  Godard  [22]  suggested  the  use  of  a  dispersion 
function 

£(p)  =  £{(|x(i)|p-i?p)9}  (4.55) 

which  was  shown  to  behave  like  the  cost  function  J ^  and  yet  it  is  independent  of  the  input 
sequence.  Note  that  Rp  is  defined  by  (4.44).  Assuming  p  -  2,  q  =  2,  (4.49)  and  (4.50)  can  be 
written  as  [22] 


J ^  —  J 1  +  <^2  + 

{4(£{|*(i)|2})J  ■  |/(0)|2  -  2<£{|*(i)|2})2}  ■  ■£  l/WI2  (4.56) 

k 


and 


£>(2)  =  j1  +  j2  + 

{4(£{|x(i)|2})2  •  j/(0)|2  -  2£{|x(i)|4}}  •  {£  \f(k) |2  +  Rl~  ^{|x(i)|4}}  (4-57) 


where  YL'k *s  taken  for  k  ^  0  and 


k 
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J  i 


(4.58) 


Comparing  (4.51)  with  (4.52),  we  see  that  for  to  be  similar  to  JM  ,  the  following  inequality 
must  be  satisfied: 


4(^{k(i)|2})2  |/(0)|2  -  2£{|x(t)|‘‘}  >  0 


or 


l/(0)|2  > 


£{l*(0l4} 

2(£{|x(i)|2})2' 


(4.59) 


Godard  suggests  (4.53)  and  /(«)  =  0  for  i  ^  0  as  a  way  of  initializing  his  algorithm. 

Based  on  what  has  been  reported  in  literature  [50]  and  on  our  simulation  experience,  the 
Godard  algorithm  has  always  converged  to  a  minimum  that  opens  the  eye  when  Godard’s  initial¬ 
ization  procedure  is  being  followed.  The  Godard  algorithm  is  summarized  in  (4.27)  and  Fig.  4.6. 
Its  convergence  for  p  =  2  is  better  than  p  =  1.  In  addition,  Godard  noted  that  convergence  im¬ 
proves  when  the  step  size  p  is  divided  by  2  at  each  10,000  iterations  [22].  The  Constant  Modulus 
Algorithm  (CMA),  suggested  independently  by  Treichler  and  Agee  in  1983  [50],  is  the  Godard 
algorithm  for  p  =  2  and  iZ2  =  1-  Ding  et  al.  [15]  reported  that  the  Godard-type  algorithms 
exhibit  local  (not  global)  undesirable  minima. 

Shalvi  and  Weinstein  recently  introduced  [48]  a  blind  equalization  scheme  based  on  the  idea  of 
matching  the  kurtosis  measures  between  the  transmitted  sequence  (x(i)}  and  the  reconstructed 
sequence  (x(i)}  at  the  output  of  the  equalizer.  The  kurtosis  of  the  input  complex  sequence  x(i), 
is  defined  by 


A'(x(0)  =  JE7{|x(*)|4}  -  2£'2{|x(1)|2}  -  |£{x2(i)|}|2  (4.60) 
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which  is  zero  for  complex  Gaussian  random  variables.  The  important  point  made  in  [48]  is  that 
if  £{|i(i)|2}  =  £{|*(*)l2},  then  (1)  | A'( x(i))|  <  |A'(*(i))|  and  (2)  |A'(i(«))|  =  |A'(*(i))|  if 
perfect  equalization  is  achieved.  Thus,  the  problem  is  to  maximize  the  magnitude  of  the  kurtosis 
measure  |A'(i(»))|  in  the  output  of  the  equalizer  at  each  iteration  subject  to  the  constraint 
£{|f(t)|2}  =  £{k(i)|2}.  One  of  the  special  cases  of  the  Shalvi- Weinstein  algorithm  is  the  original 
Godard  algorithm.  It  has  recently  been  recently  reported  that  the  Shalvi- Weinstein  algorithm  was 
originally  introduced  by  Donoho  [16]  for  real- valued  signals  and  that  the  algorithm’s  convergence 
is  only  guaranteed  for  infinite-length  equalization  filters. 

4.2.6  The  Stop-and-Go  and  Decision-Directed  Algorithms 

The  basic  idea  behind  the  Stop-and-Go  algorithm,  which  was  proposed  by  Picchi  and  Prati 
[41]  in  1987,  is  to  retain  the  advantages  of  simplicity  and  fast  convergence  (in  open  eye-pattern 
conditions)  of  the  Decision  directed  (DD)  algorithm  [33]  while  attempting  to  improve  its  blind 
convergence  capabilities. 

The  adaptation  error  e£>(i)  used  in  the  DD  algorithm  is  [33] 

eD(i)  =  i(i)  -  x(i)  (4.61) 

where  x{i)  is  the  output  of  the  equalizer  and  i(i)  the  output  of  the  threshold  detector.  Assuming 
that  the  equalizer  initial  tap  setting  corresponds  to  a  closed  eye-pattern,  eo{i)  will  be  large  most 
of  the  time  due  to  the  large  number  of  incorrect  decisions  x(i).  Consequently,  the  DD  algorithm 
cannot  converge  in  closed  eye-pattern  conditions. 

In  the  Stop-and-Go  algorithm,  Picchi  and  Prati  proposed  the  use  of  the  error  sequence 
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(4.62) 


«(0  =  |{^(*>o(‘)+  B{i)emD(i)} 

where 


4(0  =  lR(i)  +  It(i) 

B(i)  =  /*(*)-//(»■) 


and 


/«(*) 


//(0 


1,  if  sgn[eD(0]fi  =  sgn(e5(0]fi 

0,  otherwise 

1,  if  sgn[e£>(:)]/  =  sgn(e5(i)]/ 

< 

0,  otherwise. 


Note  that  65(1)  is  the  Sato  error  given  by  (4.46). 

From  the  foregoing,  it  is  clear  that  the  Stop-and-Go  algorithm  is  essentially  the  DD  algorithm 
when  the  eye  is  open.  It  is  mostly  during  closed  eye-pattern  conditions  that  the  Stop-and- 
Go  adaptation  rule  takes  place.  Also,  it  is  clear  that  the  Benveniste-Goursat  and  Stop-and- 
Go  algorithms  have  different  convergence  properties  when  the  eye-pattern  is  closed  and  similar 
convergence  properties  when  the  eye  is  open.  The  modifications  of  this  algorithms  have  been 
proposed  to  incorporate  joint  equalization  and  carrier  recovery,  decision  feedback  equalization  [lj 
as  well  as  fractionally  spaced  equalization  [21],  [45]. 
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4.3  The  CRIMNO  Algorithm 


Although  the  Bussgang  algorithms  are  different  from  each  other,  as  we  have  seen,  they  perform 
only  memoryless  nonlinear  transformations  on  the  equalizer  outputs  to  generate  the  desired  re¬ 
sponse.  This,  in  turn,  implies  that  the  cost  functions  they  attempt  to  minimize  with  respect 
to  the  equalizer  coefficients  are  also  memoryless.  These  algorithms  do  not  explicitly  employ  the 
fact  that  the  transmitted  data  are  statistically  independent ,  which  is  the  essence  of  the  new  crite¬ 
rion  we  introduce  in  this  section.  Since  statistical  independence  of  the  transmitted  data  involves 
more  than  one  data  symbols,  this  results  in  a  memory  nonlinear  transformation  on  the  equalizer 
outputs  and  thus  a  memory  nonlinear  cost  function. 

4.3.1  Criterion  with  Memory  Nonlinearity 

As  we  have  seen,  Godard  solves  the  blind  equalization  problem  by  proposing  a  cost  function 
which  is  independent  of  the  transmitted  data,  and  yet  reaches  its  global  minimum  at  perfect 
equalization.  The  Godard  cost  function  (  also  known  as  the  constant  modulus  algorithm  (CM A) 
[22]  is  given  by  (4.50)  and  (4.44). 

Note  that  only  the  expected  value  of  some  function  of  the  current  equalizer  output  appears 
in  Godard’s  cost  function.  Therefore,  the  Godard  criterion  only  makes  use  of  the  probability 
distribution  of  the  transmitted  data.  It  does  not  explicitly  use  the  fact  that  the  transmitted  data 
are  statistically  independent. 

Assume  that  perfect  equalization  is  achievable  and  consider  the  situation  where  perfect  equal¬ 
ization  has  indeed  been  achieved.  That  is 

x(t)  =  x(i  -  D ) 
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where  d  is  some  positive  number,  which  accounts  for  the  delay.  Since  the  transmitted  data 
x(i)  are  statistically  independent  from  each  other,  so  are  the  equalizer  outputs  x(i)  at  perfect 
equalization.  In  addition,  for  most  transmitted  data  constellations,  the  mean  of  transmitted  data 
x(i)  is  zero.  Therefore,  at  perfect  equalization  ,  we  have 

£{f(i)x*(i  -  /)}  =  E{x(i  -  D)x*(i  -/-£)}  =  £{x(i  -  D )}  •  £{x*(i  -/-£)}  =  0 

By  making  use  of  this  property  and  combining  it  with  Godard’?  criterion,  we  obtain  a  new 
criterion,  called  criterion  with  n  ?mory  nonlinearity  (CRIMNO),  which  is  the  minimization  of  the 
following  cost  function: 

=  w0E(\x(i)\r-Rp)2  +  w1\E{x(i)x'(i-l)}\2  +  ---  +  wM\E{x(i)r(i-M)}\2.  (4.63) 

The  rationale  behind  the  CRIMNO  is  that  since  each  term  reaches  its  global  minimum  at 
perfect  equalization,  by  appropriately  combining  them,  we  can  increase  the  convergence  speed  of 
the  corresponding  CRIMNO  algorithm  [12],  [13].  This  is  clearly  demonstrated  in  the  simulations 
section. 

Remarks: 

1.  Memory  nonlinearity:  the  CRIMNO  cost  function  depends  not  only  on  the  current  equalizer 
output,  but  also  on  the  previous  equalizer  outputs.  As  such,  it  results  to  a  criterion  with 
memory  nonlinearity.  The  parameter  M  determines  the  size  of  memory. 

2.  Generalization  of  the  Godard  criterion:  when  wq  =  1,  u;,  =  0  for  i  ±  0,  the  CRIMNO 
cost  function  reduces  to  the  Godard  cost  function.  Therefore,  the  CRIMNO  criterion  may 
be  seen  as  a  generalization  of  the  Godard  criterion. 
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3.  Constant  Modulus  Property:  the  CRIMNO  criterion  preserves  the  constant  modulus  prop¬ 
erty  inherer  in  Godard. 

4.3.2  CRIMNO  Blind  Equalization  Algorithm 

Define  the  equalizer  coefficient  vector  u(i)  =  [n1(i),  •  -  • ,  u^(i)]T,  and  the  received  signal  vector 
y(i )  =  [j/(  z ) ,  ■  ■  ■  ,y{i  —  N+  1)]T,  where  N  is  the  length  of  the  equalizer.  Then  the  equalizer  outputs 
are 


x(i  -  /)  =  yT(i  -  l)  ■  u(i),  l  =  0,  (4.64) 

where  superscript  T  denotes  transposition  of  a  vector. 

Differentiating  the  cost  function  with  respect  to  the  equalizer  coefficient  vector  u(i),  we 
obtain  [12] 


a, v/(2> 

=  4u>o£[jf(t)f(*)(|*(t)|a  “  ^2)] 

+2 Wi[E(y'(i  -  l)x(i))E(x"(i)x(i  -  1))  +  E(y‘(i)x(i  -  1  ))E(x(i)x“(i  -  1))] 

+  ••• 

+2w\f[E(y‘(i  -  M)x(i))E(x"(i)x(i  —  M))  +  E{£(i)x(i  —  M))E(x(i)x*(i  -  Af))].  (4.65) 

By  using  the  steepest  descent  method  to  search  for  the  minimum  point,  we  obtain 


u(‘  +  1)  =  w(i)  -  a  •  {4wo£[y*(i)i(*Kli(0l2  -  #2] 
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+2t«»1[E(Jf(i  -  l)x(i))£(i*(i)i(i  -  1))  +  E(f(i)x(i  -  l))£(*(i)x*(i  -  1))] 


+  •  •  • 

+2 u>M[E(ym(i  -  M)x(i))E(x~(i)x(i  -  M))  +  £(y*(i)x(z  -  M))E(x(i)x*(i  -  M))]( 4.66) 

where 

«(*)  =  [ui(i),...,uN(i)]T 

In  (4.6),  the  expectation  are  the  ensemble  averages  taken  with  respect  to  transmitted  data  x(i ) 
while  the  channel  impulse  response  /(i)  and  the  equalizer  coefficients  u(i )  are  treated  as  fixed. 

If  we  use  single  point  estimates  for  the  ensemble  averages,  we  obtain  the  stochastic  gradient 
CRIMNO  algorithm: 


u(i  +1)  =  u(i)  -  a[4w0ym(i)x(i)(lx(i)l2  -  R2 )  +  2tnl(j/*(i  -  l)x(t)|x(t  -  1)|2  +  y*( t  -  l)i(i  -  l)|5(i)|2) 

4 - f  2wM(y*(i)x(i)\x(i  -  M) |2  4-  y*(*  -  M)x(i  -  M )|x(i)|2)] 

=  «(*)  -  a[y*(t)x(i)  *  (4ie0|x(i)|2  4-  2w\\x(i  -  1)|2  4-  ■  •  •  4-  2wM\x(i  -  M) |2  -  4w0R2) 

+2wiy*(i  -  1)5(*  -  l)|x(i')|2  4- - 1-  2w\fy*(i  -  M)x(i  -  M)|x(z)|2].  (4.67) 

Note  that  at  each  iteration,  all  equalizer  outputs  x(i  —  /),  /  =  0,1,  axe  recalculated  using 

current  (most  recent)  equalizer  coefficient  vector  tz(t)  via  x(i  —  /)  =  yT(i  —  This  requires  a 

lot  of  computations.  If,  instead  of  using  the  current  equalizer  coefficient  vector  u(t),  we  use  the 
delayed  equalizer  coefficient  vector  u(i  —  l )  to  calculate  x(i  —  /).  Note  that  (for  small  step-size, 
which  is  required  for  the  stability  of  stochastic  gradient-type  algorithm,  the  difference  between 
u( i)  and  u(i  -  /)  is  negligible.  Then  at  each  iteration  we  will  need  to  calculate  only  one  equalizer 
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output  i(i)  using  the  current  equalizer  coefficient  vector  u(i). 

4.3.3  Adaptive  Weight  CRIMNO  Algorithm 

The  shape  of  the  cost  function  depends  on  the  choice  of  weight  in/.  So  does  the  performance  of 
the  CRIMNO  algorithm.  Here,  we  describe  an  ad  hoc  way  of  adjusting  the  weights  on-line  in  the 
blind  equalization  process. 

The  basic  idea  is  to  estimate  the  values  of  all  terms  in  the  CRIMNO  cost  function  over  a 
block  of  data  and  then  set  the  weights  used  in  the  next  block  proportional  to  the  deviations  of 
the  corresponding  terms  from  their  ideal  values  at  perfect  equalization.  The  rationale  behind 
this  scheme  is  that  if  one  term  in  the  criterion  has  a  large  deviation  from  its  ideal  value,  then  in 
the  next  block  the  weight  associated  with  it  will  be  set  equal  to  a  large  value,  and  consequently, 
the  gradient-descent  method  will  bring  it  down  quickly. 

To  elaborate  on  this  idea,  we  rewrite  the  CRIMNO  cost  function  as 


A/(p)  =  wqJ0  +  u'iJi  H - (4.68) 

where 

jo  =  £  (ii(or  -  *P)2 

Jt  =  |£(x(i)x*(i  -  /)|2  1  <l<  M.  (4.69) 
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Holi no  tlif  deviation  of  the  /lit  term  /)(./;)  by 


/>(•'/)  =  | j,  -  j[o) |, 


(4.70) 


when*  J['  *  is  the  value  of  Jt  at  perfect  equalization  =  0,  /  =  1,---,A/).  Then  the  weights 
are  adjusted  using  the  following  formulae: 


U’o 


IL\ 


1qD{Jq)  7oZ?(Jo)  <  A 

< 

A  1qD{Jq)  >  A 

' 

< 

A  7  !>(./,)  >  A 


(4.71) 


where  A0  >  0  is  the  scaling  constant  for  the  first  term,  7  >  0  is  the  scaling  constant  for  the  other 
terms  in  the  CRIMNO  cost  function,  and  A  is  a  constraint  on  the  maximum  value  of  the  weights 
to  guarantee  the  stability  of  the  algorithm. 

The  CRIMNO  algorithm  with  weights  adjusted  in  this  way  is  called  adaptive  weight  CRIMNO 
algorithm.  Some  in-depth  comments  are  provided  below: 


1.  When  the  deviations  of  all  terms  vary  proportionally,  the  adaptive  weight  scheme  be¬ 
comes  an  adaptive  step-size  algorithm.  Moreover,  the  adaptation  is  done  automatically. 
So  when  the  algorithm  converges,  then  weights  decrease  to  zero.  Hence,  the  adaptive 
weight  CRIMNO  algorithm  acquires  as  a  byproduct  the  decreasing  step-size,  which  has 
been  proven  to  be  an  optimal  strategy  for  equalization  [51]. 

2.  For  the  adaptive  weight  CRIMNO  algorithm,  the  shape  of  the  cost  function  is  changing. 
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The  local  minima  of  the  cost  function  are  also  changing.  Thus,  what  is  local  minimum  of 
the  cost  function  at  one  iteration  may  not  be  at  the  next  iteration.  However,  whatever 
the  change  of  the  weights,  the  global  minimum  does  not  change,  and  it  always 
corresponds  to  perfect  equalization. 

3.  The  adaptive  weight  CR.IMNO  algorithm  tends  to  move  out  of  a  local  minimum  of  the  cost 
function  quickly,  if  the  cost  function  has  local  minima  and  the  algorithm  gets  trapped  in 
one  of  them.  This  is  based  on  the  following  arguments.  In  the  adaptive  weight  CRIMNO 
algorithm,  the  equalizer  coefficient  increment,  Au(t  +  1)  =  u(i  +  l)  -u(i)  is  a  random  vector, 
the  variance  of  which  determines  how  fast  the  algorithm  will  move  out  of  a  local  minimum. 
The  variance  of  the  equalizer  coefficient  increment  depends  on  the  step-size  a,  gradient 

and  the  weights  wi  (proportional  to  D(J/)).  The  step-size  and  gradient  are  the  same 
with  the  fixed  weight  CRIMNO  algorithm;  we  thus  concentrate  on  the  third  one:  to/,  or 
equivalently  D(Ji).  At  a  global  minimum  of  the  cost  function,  D(J{)  are  all  small,  thus, 
the  variance  of  the  equalizer  coefficient  increment  is  small.  Therefore,  the  algorithm  will 
remain  near  the  global  minimum.  However,  that  is  not  the  case  with  a  local  minimum.  In 
that  case,  D(Ji)  will  be  large,  therefore,  the  variance  of  the  equalizer  coefficient  increment 
will  be  large  (relative  ot  the  case  at  the  global  minimum),  and  the  algorithm  will  move  out 
of  that  minimum  quickly.  Moreover,  the  larger  the  deviation  Z?(J/),  the  more  quickly  the 
algorithm  will  move  out  of  the  local  minimum. 

4.  Blocks  of  data  are  used  to  estimate  {•//}.  The  block  length  should  be  sufficiently  long  to 
make  the  variances  of  the  estimates  small,  but  not  long  enough  to  make  the  weight  update 
fall  behind. 
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4.3.4  CRIMNO  Extensions 


In  this  section,  the  CRIMNO  ideas,  i.e.,  memory  nonlinearity,  are  extended  to  the  following  cases: 
( 1 )  the  case  of  correlated  inputs;  (2)  the  case  when  higher-order  correlation  terms  [38]  are  utilized. 

Colored  CRIMNO 

One  of  the  key  assumptions  in  the  CRIMNO  criterion  is  that  the  transmitted  data  are  independent 
and  identically  distributed  ( i.i.d ).  However,  in  practice,  this  may  not  be  true  for  QAM  signals. 
Usually,  in  order  to  overcome  the  phase  ambiguity  caused  by  the  squaring  loop  for  carrier  recovery, 
differential  encoding  techniques  are  used,  which  correlate  the  input  data  when  the  source  symbols 
are  not  equiprobable.  Since  the  operations  of  differential  encoding  are  known,  the  autocorrelations 
of  the  input  data  can  be  derived.  In  the  case  where  the  autocorrelations  of  the  input  data  are 
known  a  priori,  the  CRIMNO  criterion  can  be  modified  as  follows: 

A/jp)  =  u'0J5(|x(i)|p  — ^p)2  f  tni|£'(x(i)x*(i  —  1)  — /3i[24 - \-WM\E(x(i)xm(i  —  M))-  Pm\2  (4.72) 

where  Pi  =  E(x(i)x*(i  —  /))  are  the  known  autocorrelations  of  the  transmitted  data. 
Higher-Order  Correlation  CRIMNO 

Here,  a  criterion  which  exploits  the  higher-order  correlations,  such  as  the  fourth-order  statistics 
of  the  equalizer  output,  is  given  below: 


M(hp)  =  w0E{\x{i)\P  -  Rp)2  +  2>,|£(x(i)x*(i-  0)|2 

l 

+  Vjki\E(x(i)x*(t  —  j)x(i  —  k)x*(i  —  /)|2  (4.73) 

j,k,l  all  different 
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The  performance  of  both  (4.73)  and  (4.74)  criteria  needs  to  be  investigated. 

4.3.5  Computer  Simulation 

Computer  simulations  have  been  conducted  to  compare  the  performance  of  the  adaptive  weight 
CRIMNO  algorithm  with  that  of  the  Godard  (or  CMA)  algorithm.  Fig.  4.6  shows  the  perfor¬ 
mance  of  the  adaptive  weight  CRIMNO  algorithm,  compared  with  that  of  the  Godard  algorithm 
under  the  different  step-sizes,  including  the  optimum  one:  We  see  that  the  performance  of  the 
adaptive  weight  CRIMNO  algorithm  is  better  than  or  approaches  that  of  the  Godard  algorithm 
with  optimum  step-size.  Fig.  4.7  shows  the  performance  of  the  adaptive  weight  CRIMNO  algo¬ 
rithm  for  different  memory  sizes  (M  —  2.4.6).  Fig.  4.8  shows  that  the  corresponding  eye-patterns 
at  iteration  20000.  We  see  that  the  larger  the  memory  size  M,  the  better  the  performance  of 
the  adaptive  weight  CRIMNO  algorithm.  Table  4.2  lists  the  computational  complexity  of  the 
CRIMNO  algorithm,  the  adaptive  weight  CRIMNO  algorithm,  and  the  Godard  algorithm.  We 
see  that  there  is  only  a  little  increase  in  computational  complexity.  Therefore,  the  performance 
improvement  is  achieved  at  the  expense  of  little  increase  in  computational  complexity. 

5  ALGORITHMS  WITH  NONLINEARITY  IN  THE  INPUT 
OF  THE  EQUALIZATION  FILTER 

The  Polyspectra  Based  Techniques 

Another  class  of  blind  equalization  algorithms  are  those  algorithms  which  are  based  on  higher- 
order  cumulants  or  polyspectra  [36],  such  as  the  tricepstrum  equalization  algorithm  (TEA) 
[24],  the  power  cepstrum  and  tricoherence  equalization  algorithm  (POTEA)  [7],  and  the  cross- 
tricepstrum  equalization  algorithm  (CTEA)  [8].  All  these  algorithms  perform  nonlinear  transfor- 
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mation  on  the  input  of  the  equalization  filter.  This  nonlinear  transformation,  e.g.  the  generation 
of  the  higher-order  cumulants  or  polyspectra  of  the  received  data,  is  a  memory  nonlinear  trans¬ 
formation,  because  it  employs  both  the  present  and  the  past  values  of  the  received  data.  The 
use  of  the  higher-order  statistics  of  the  received  data  is  necessary  for  blind  equalization,  since 
the  correct  phase  information  about  the  channel  can  not  be  extracted  from  only  the  second-order 
statistics  of  the  received  data  [14],  [29],  [34],  [35],  [37],  [42]. 

5.1  Definitions  and  Properties:  Cumulants  and  Higher  Order  Spectra 

The  readers  are  assumed  to  be  somewhat  familiar  with  the  basic  material  of  higher-order  spectra. 
However,  some  important  properties  which  will  be  used  in  the  subsequent  sections  are  given. 

5.1.1  Definitions 

1.  Definition  of  Cumulants: 

Given  a  set  of  n  real  random  variables  {xi,X2>  •  * ->£n}»  their  nth  joint  cumulants  of  order 
is  defined  as 


-,»n) 


dv\dv-i  ■  ■  -dv-n 


t>i  =  v2  =  •  •  •  =  vn  =  0 


(5.1) 


where 


$(vi,v2,-  •  -  ,v„)  =  £{exp  j(vi xi  +  +  v„xn)}. 


(5.2) 


Given  a  real  stationary  random  sequence  {x(t)}  with  zero  mean,  E{x(i )}  =  0,  then  the 
nth-order  cumulant  of  the  random  sequence  depends  only  on  the  time  difference  and  is 
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defined  as 


Lx{T\  )T2j  ' 


.A,  dnln$x(vuv2,---,vn ) 

’ >  ~n-l)  —  \—J) 


dv\dv2  •  •  ■  dvn 


vi  =  v2  =  •  •  •  =  t>„  =  0 


(5.3) 


where  rq,  r2,  •  •  • ,  rn_!  are  integers  and 


^x(ui,u2,  •  •  -,v„)  =  E  {exp  j  (uii(i)  +  v2x(i  +  n)  + - b  vnx(i  +  rn_!))}  (5.4) 

Given  a  set  of  real  jointly  stationary  random  sequences  {zfc(i)},  k  =  1,2,  ••■,n  with  zero 
mean,  JSfx^i)}  =  0,  then  the  nth-order  cross- cumulant  of  the  sequences  depends  only  on 
the  time  difference  and  is  defined  as 


T  /-  ,  ,  ^ndnln$X' i,2,...,n(vuv2,---,vn) 

,n(rlt  r2i  '  ’  ')  Tn-l)  —  (~ n_.  31 


dv\dv2  •  •  ■ dvn 


Vj  =  V2  =  •  •  •  =  Vn  =  0 

(5.5) 


where  T\ ,  r2,  •  •  • ,  rn_j  are  integers  and 


$r,i,2, -,n(ui,t/2,-  •  -  ,t/n)  =  £{exp  j  (i>iXi  (» ) -b  v2x2(i  +  rx)  +  ■■■  +  vnzn(i  +  rn_j))}. 

(5.6) 

2.  Definitions  of  Higher-Order  Spectra. 

Higher-order  spectra  are  defined  to  be  the  Z-transforms  of  the  corresponding  cumulants 
[34],  [38].  Specifically,  a  nth-order  spectrum  of  a  real  stationary  zero  mean  random  se¬ 
quence  {x(i)}  is  just  the  (n  -  l)-dimensional  Fourier  transform  of  the  nth-order  cumulant 
Lx{t\,t2,  •  •  •,  rn_i)  of  the  random  sequence.  That  is 
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A  n_1 

$x (  -1 »  ~2  '  '  '  i  zn— 1 )  =  y  Lx{T\  ,  T2,  •  •  • ,  Tn_i  )  J"J  Zj  1 .  (5 . 7  J 

/=1 

When  n  =  2,3,4  the  corresponding  spectrum  is  called  power  spectrum,  bispectrum,  and 
trispectrum,  respectively. 

A  nth-order  cross-spectrum  of  a  set  of  real  stationary  zero  mean  random  sequences  {ijt(i)}, 
k  =  1, 2,  •  •  • ,  n,  is  defined  as  the  (n  -  1)  dimensional  2-transform  of  the  nth-order  cumulant 
-,n(ri,  r2,  •  ■  •rn_|)  of  the  random  sequence,  that  is 

n— 1 

Sr, 1,2,  *2,  •••,*«-!)  =  £*,1,2, -,»(rl>  r2,  *  •  ' ,  ^n-l )  ^  '  (5-8) 

n.m.—Tn-i  /=i 

3.  Definitions  of  coherence. 

Coherence  is  defined  as  the  higher-order  spectrum  normalized  by  the  power  spectrum. 
Specifically,  a  nth-order  coherence  of  a  real  stationary  zero  mean  random  sequence  x(i)  is 
defined  as 


#x(zi,z2, 


i  A 

•,*n-i)  = 


$x(zl  i  z2i  *  *  *  i  zn— l) 


[S,(*i)S,(z3)  •  ••5x(zn-1)5I(nr=-1l  2/-1)]* 


(5.9) 


An  alternative  definition  for  the  nth-order  coherence,  which  is  equivalent  to  the  above 
definitions,  is 


*2i  ‘  — 


Sx(2\  i  z2i  '  "  '  t  zn- 1 ) 


Sx{*l  iz1  )'"7Zn-l)J 


(5.10) 


4.  Definitions  of  Cepstrum  of  Higher-Order  Spectrum 

The  cepstrum  is  defined  as  the  inverse  Z-transform  of  the  log  function  of  the  spectrum. 
Specifically,  a  cepstrum  for  the  nth-order  spectrum  of  a  real  stationary  zero  mean  random 
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sequence  {.r(i)}  is  defined  as 


(  7"l  i  T-},  *  *  * ,  Tn_  i  )  —  /j  [/72  Sx  ( Z\  ,  Z2y  *  '  ' ,  £n  — 1  )]  (5.11) 

A  cepstrum  for  the  nth-order  cross  spectrum  of  a  set  of  real  stationary  zero  mean  random 
sequence  {x(i)},  i  =  1,2,  •••,»,  is  defined  as 

*-x,l,2t  ",n ( ^"l >  ‘  ‘  ‘ ^"n— 1  )  =  Z  [fn  SXii2,..-,n(~l ?  ^2?  ’  '  '  >  ^n—  1 )]  (5.12) 

When  n  =  2,3,4,  the  corresponding  cepstrum  is  called  power  cepstrum,  bicepstrum  and 
tricepstrum,  respectively. 

5.1.2  Properties 

Some  important  properties  of  cumulants  are  shown  below. 

1.  If  xi,  12,  •  •  • ,  x„  can  be  divided  into  two  or  more  groups  which  are  statistically  independent, 
then  the  cumulant  L(x i,x^,  •  •  • ,  x„)  is  zero. 

Specifically,  if  (x(i)}  are  an  independent,  identically  distributed  random  variables,  the  nth- 
order  cumulant  of  the  sequence  (x(z)}  is 

Lx(Tl,T2,---,Tn~l)  =  (5.13) 

2.  Cumulants  of  higher  order  (n  >  3)  are  zero  for  Gaussian  processes. 
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3.  If  (i(i)}  and  (y(i)}  are  statistically  independent  random  sequences  and,  z(»)  =  x(i)  +  y(»), 
then 

=  Lx(n,T2r--,Tn-i)  +  Ly(ri,T2,-  ■  -  ,Tn-i).  (5.14) 

5.2  Tricepstrum  Equalization  Algorithm  (TEA) 

5.2.1  Problem  Formulations 

We  assume  that  the  received  sequence  after  being  demodulated,  low-pass  filtered  and  syn¬ 
chronously  sampled  (at  rate  y)  can  be  written  as: 

U 

»(*)  =  *(*)  +  «>(*)-  £  f(k)x(i-k)  +  w(i)  (5.15) 

k=-L\ 

where  the  nonminimum  phase  equivalent  channel  impulse  response  {/(i)}  accounts  for  the  trans¬ 
mitter  filter,  non-ideal  channel  (or  multipath  propagation),  and  receiver  filter  impulse  response; 
the  input  data  sequence  {x(i)}  is  generally  complex,  non-Gaussian,  white,  t.i.d.,  with  £{x(:)}  = 
0,  2?{x(i)3}  =  0  and  £{x(i)4}  -  3[F{x(i)2}]2  =  7*  ^  0;  for  example  {x(i)}  could  be  a  multi-level 
symmetric  PAM  sequence  or  the  complex  baseband  equivalent  sequence  of  a  symmetric  QAM 
signal;  the  additive  noise  (u;(i)}  is  zero-mean,  Gaussian,  generally  complex  and  statistically  in¬ 
dependent  from  {x(i)};  we  also  assume  that  the  channel  transfer  function  F(z)  (Z-transform  of 
{/(»)})  admits  the  factorization  [24) 

F{z)  =  A  ■  I{z~l)  ■  0{z)  (5.16) 

the  factor 

zeros  and  poles  inside  the  unit  circle.  The  factor  O(z)  =  n*=i(l  “  Hz)y  |&*|  <  1  is  a  maximum 


I(z~x)  =  akZ  ~r'»  l°fcl  <  l.lcfcl  <  1,  is  a  minimum  phase  polynomial,  i.e.,  with 

-<**"■*) 
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phase  polynomial,  i.e.,  with  zeros  outside  the  unit  circle.  The  parameter  A  is  a  constant  gain 
factor.  Finally,  the  sequence  {y(i)}  is  the  input  to  the  blind  equalizer. 

5.2.2  Relations  of  Tricepstrum  of  the  Linear  Filter  Output 

The  input  to  the  channel,  x(i),  is  a  non-Gaussian  i.i.d.  random  sequence,  thus 

Sx(z 1,22,23)  =  Hx-  (5.17) 

The  trispectrum  of  the  output,  y(i),  of  the  channel  (linear  filter)  is 


Sy(*i,*a,Z3)  =  lxF(zi)F{z2)F(23)F(zl  1z21z31) 

=  7*  •  A4  •  I(z~l)I(z2l)  ■  /(zj1)  •  I(z !, z2, z3)  0(Zl)  ■  0{z2)  ■  0(z3)  ■  0{z?z2'z3'\ 5.18) 

Taking  the  logarithm  of  5v(zi,  z2,  z3)  and  then  the  inverse  Z-transform,  after  some  manipulation, 
we  obtain  [24] 
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log(7r^4)  m  =  n  =  /  =  0 

m  >  0,n  =  /  =  0 


_iA(n) 

n  >  0,m  =  /  =  0 

/>0,m  =  n  =  0 

-L  £(-"») 

m 

m<0,n  =  /  =  0 

n 

n<0,m  =  /  =  0 

/<0,m=n  =  0 

n 

m  =  n  =  /  >  0 

m  =  n  =  /  <  0 

0 

otherwise 

(5.19) 


where,  A^\B^  are  the  minimum  and  maximum  phase  differential  cepstrum  parameters  of  the 
system,  corresponding  to  I(z~l)  and  0(z),  respectively.  They  are  defined  as  follows: 


(5.20) 


k=l 


k=l 


fc= 1 


In  addition,  the  following  identity  holds  between  the  fourth-order  cumulants  Ly{m,n,l)  and  tne 
tricepstrum  cy(m,  n,/)  : 


53  {A(J)[Lv{m  -  J ,  n,  /)  -  Ly(m  +  J,n  +  J,l  +  J)]}  + 

J- 1 

OO 

53  |i?^[£y(m  -  J,n  -  J,l  -  J)  -  Ly(m  +  J,  n,/)]|  =  -m  ■  Ly(m,n,l )  (5.21) 

J=i 
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where  we  define, 

-A<J\  J  =  l,...oo 

J  0,0)  =  ■ 

B(~J\  J  =  —  1, ...  —  oo. 

aIjKbW,j  =1,2,..  .  are  the  minimum  and  maximum  phase  cepstral  coefficients  respectively, 
which  are  related  to  the  zeros  F(z).  However,  in  practice,  the  summation  terms  in  (5.21)  can  be 
approximated  by  arbitrarily  large  but  finite  values  because  and  decay  exponentially  as 
J  increases. 

In  practice  ihe  fourth-o*  Jer  cumulants  Zv(-)  in  (5.21)  need  to  be  substituted  by  their  estimates 
Ly(-)  obtained  from  a  finite  length  window  of  the  received  samples  {t/(i)}. 

The  TEA  algorithm,  uses  (5.21)  in  order  to  form  an  overdetermined  system  of  equations, 
i.e.,  we  have  more  equations  han  unknowns.  Then,  TEA  solves  this  ov^rdetermined  system 
of  equations,  adaptively,  using  an  LMS  adaptation  algorithm.  At  each  iteration  an  estimate  of 
the  cepstral  parameters  {A^}  and  is  computed.  The  coefficients  of  the  equalizer  are 

calculated  for  {A^}  and  {B^}  by  means  of  the  iterative  formulas. 

5.2.3  TEA  Algorithm 

Let: 

{y(i)}:  Ihe  received  zero-mean  synchronously  sampled  communication  signal. 

N j,  Ar2:  Lengths  of  minimum  and  maximum  phase  components  of  the  equalizer. 

p,q\  Lengths  of  minimum  and  maximum  phase  cepstral  parameters. 

My\m,  n,l):  Estimated  fourth-order  moments  of  {y(i)}  at  iteration  (i). 

Ry\j):  Estimated  second-order  moments  of  {y(i)}  at  iteration  (i). 

Ly\m,n,l):  Estimated  fourth-order  cumulants  of  {y{i)}  at  iteration  (i). 

Symmetric  PAM  or  QAM  Signaling: 
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In  general,  for  1-D  ( e.g .  PAM)  or  2-D  (e.g.  QAM)  signaling  with  symmetric  constellations: 

n,  l)  —  i$\m)  •  -  m)  (5.22) 

For  symmetric  square  (L  x  I)  QAM  constellations: 

n,  /)  =  M^\m,  n,  l )  (5.23) 

and  A^,B^  are  the  minimum  and  maximum  phase  differential  cepstrum  parameters  at  iter¬ 
ation  (i)  respectively.  L\  and  Li  are  the  orders  of  the  minimum  phase  and  maximum  phase 
components  of  the  FIR  channel,  respectively.  Note  that,  {a;},  |a,|  <  1  and  {i},  |6,|  <  1  are 

the  zeros  of  the  minimum  and  maximum  phase  components  of  the  FIR  channel,  respectively. 
(u(t)}:  The  coefficients  of  the  equalizer  at  iteration  ( i ). 

{£•(*)}:  The  coefficients  of  the  equalizer  at  iteration  ( i ). 

At  iteration  (i):  *=  1,2,... 

Step  1  Estimate  adaptively  the  Ly\m,n,l),  —M  <  m,n,l  <  M,  from  finite  length  win¬ 
dow  of  {y(fc)}  as  described  below.  M  should  be  sufficiently  large  so  that  Ly(m,n,I )  ~  0 
for  |m|,  |n|,  |/|  >  M.  Assuming  that  at  iteration  (0)  we  have  received  the  time  samples 
{y(  1),  •  •  •  y( fiag)}  we  proceed  as  follows: 

Stationary  Case  with  Growing  Rectangular  Window 

My]{m,n,l)  =  (1  -  77(1))-  A/^-1)(m,ra,/)  +  q(i)  ■  y(S'4)y(S'4  +  m)y(S\  +  n)y(S'4  +  /)  (5.24) 

Ri'Hj)  =  (1  -  ^(0)-4,'1)O')  +  ^(l')-y(^)y(5'  +j)  (5.25) 

where,  »?(i)  =  7+7^;,  S'i  =  min(i  +  /,ag,t  +  /lag  -  m,i  +  7lag  -  n,i  +  I]gLg  -  /),  S'2  =  min(i  + 
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/ |ag.  i  4-  / |as,  -  j).  Finally  substitute  (5.24 )  ami  (5.25)  into  (5.22)  or  (r»  l>:t). 
Nonstationary  Case 
First  Way: 


for 

i  <  l\ 

use  (5.24),  (5.25)  with 

»/(i)  =  T— T— 

1  +  7lag 

for 

i  >  /v 

use  (5.24),  (5.25)  with 

//(  l  )  =  //  — -  lixed 

(5.2(1 

1 1  should  have  a  small  value  (0  <  //  <  1),  lor  example  //  =  0.01. 

Second  Way:  (for  symmetric  QAM  signaling) 

Since  in  this  case  the  second-order  moment  lty(j)  =  0,  we  can  use  M y(m,n,  l)  with  a  forgetting 
factor  tr.O  <  w  <  l  as  follows.  (S’1,  is  as  before): 


(i  +  /iag)-.V/<,)(m.n,/) 


— H-/lag)-Jl7<'  l\m,  n,  l)+y(S,4)y(S'4  +  m)y(S,4  +  n)y(S\  +  l)  (5.27) 


and  substitute  (t  -f  /)ag)  ■  Afy‘\in,  n.l)  for  Ly\m,  n,  l)  everywhere. 

Third  Way: 

Formulas  (5.24)  and  (5.25)  could  be  used  in  nonstationary  environments  by  reinitializing  the 
algorithm  after  certain  number  of  iteration  or  when  a  channel  change  is  detected, 

Remarks: 


By  using  the  symmetry  properties  of  fourth-order  cumulants  only  cumulants  need 

to  be  calculated. 


The  assumption  that  I jag  data  have  been  received  at  iteration  (0)  avoids  ill  conditioning 
of  the  matrices  of  the  system  given  in  Step  3.  It  causes  a  delay  to  /jag  at  the  input  of  the 
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equalizer. 


Step  2 

Select  p,  q  arbitrarily  large  so  that  ~  0  and  ~  0  for  I  >  p  and  J  >  q.  For  example, 
C  =  10~4  (  very  small  constant) 


—  0 
B<~J)  ~  0 


for 

for 


I  >  p  =  int 
J  >  p  —  int 


(5.28) 


where,  int[-]  denotes  integer  part  and  max\a{\  <  a  <  1,  mai|6,|  <  (3  <  1. 
Define:  w  —  max(p,q),z  <  j,s  <  z. 

Step  3 

Using  the  relation: 


£  {^g  [x</>(m  -  X,  n, /)  -  X<*>(m  +  /,»  +  /,/  +  /)]}  + 

/=i 

£  [i«(m  -  J,n- J,  l- J)-  L['\m  +  X,n,/)]}  =  -m  •  L^(m,nJ)  (5.29) 


J=i 


with  m  =  — w, . . . ,  —  1, 1, . . .,  te,n  =  -z, . .  .,0, . . .,  z  and  1  =  -s, . . . ,0, . . . ,s  to  form  the  overde¬ 
termined  system  of  equations: 


P{i)  •  d(i)  =  p(i)  i  =  0,1,2,... 


(5.30) 


where  P(i)  is  [Np  x  (p  +  g)]  (where  Np  =  2w  x  (2 z  +  1)  x  (2s  4-  1))  matrix  with  entries  of  the 
form  {Xy '(m,  n,  /)  —  Ly\cr,  r,  A)};  a(i)  =  [ijg, . . .,  ijj,  2?{J, . . . ,  B$]T  (T  denotes  transpose) 
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is  the  (p  +  q)  x  1  vector  of  unknown  cepstral  parameters;  p(i )  is  the  Np  X  1  vector  with  entries 
of  the  form  {-m  ■  Ly{m,  n,  /)}. 

Step  4 


Assume  that  initially  a(0)  =  [0, . .  .  ,0]T.  Update  a(i)  =  [A(l), . . i?(l), . .  .,B^]T 

a(i+l)  =  a(i)  +  n(l)-PH(i)-e(i), 
e(i+  1)  =  p(0  ~  P(i)  ■  d(i),  0<p(i)<2/tr{PH(i)-P(i)} 


Step  5 

Calculate  the  equalizer  normalized  coefficients.  Initialize  ttn„(i,0)  =  o,nu(i>0)  = 
estimate: 

1  fc+1 

»in v(i,k)  =  -  -  *inv(*,k  -  n  +  1) 

n=2 

A:  =  1, . . iVj 

o.nv(u  *)  =  £  Y,  [--®(J)~n)]  ■  Oin„(t,  k  —  n+1) 

K  n=k+ 1 

k  =  -h...,-N2 

where  (i)  is  the  iteration  index  taking  values  i  =  1,2,3 . . .  Then, 

^normih  k)  =  *tnv(U  k)  *  d,nU(i,  fc),  k  =  —  .  .  . ,  0,  .  .  . ,  N\ 

where  {*}  denotes  linear  convolution. 

Step  6 


as  follows 

(5.31) 

(5.32) 

1  and  the 


(5.33) 

(5.34) 


(5.35) 
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Estimate  the  gain  factor  A(»)  as  follows:  In  step  (1)  we  have  already  calculated: 


Z<;)(o,o,o)~7*-£(/(fc))4 

k 

A/i°(0)  *  Qr  ■  D/W)2 

k 


(5.36) 


where  Qx  =  E{{x(k))2),  7r  =  E{{i(k))*}  -3  ■  Q\  are  known.  Also: 


.  fc+i 

i(i, k)  =  -  n  +  1),  k=l,...,p 

n=2 

|  J  B((‘rn)-o(i,A;-n  +  l),  k  =  -l,...,q  (5.37) 

*  n=4-+i 


and  /(i,  fc)  =  i(t\  A)*o(i, &),  {*}  denotes  convolution,  Q  j(i)  =  ]£j fc(/(0*))2’  7/(0  -  Hfc(/(*\  *))4> 
Then  (the  sign  of  -jy-  cannot  be  identified): 

For  L-PAM  Signaling: 


1 

A(«) 


\  Vj'>(0)  ) 


(5.38) 


For  i2-QAM  Signaling: 


J_~(_^<!LV  =  |Tj|t.«>t.(  2E1  0 

A{i)  U?(0,0,0)/  J  \L['\o,0,0)) 


(5.39) 


since  7r  <  0  for  equi-probable  I2-QAM  signaling. 

Step  7 

Let,  y(  i)  =  [y(»  +  A2),  .  .  .  ,y(i  —  JV l)]^-  aQd  [iinorm(0]  =  [ttnorm(*»  •  •  •  1  Unorm(*\  ^1 )]  •  Fi- 
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nally,  the  output  of  the  TEA  equalizer  is: 


*(*)  =  77T  •  [MnormCO]7"  ‘  2/(0  (5.40) 

While  most  of  the  Bussgang  blind  equalization  algorithms,  which  are  based  on  non-MSE  cost 
function  minimization,  have  not  been  shown  to  be  globally  convergent  and  cases  of  their  mis- 
convergence  have  been  encountered,  the  TEA  algorithm,  designed  as  described  above,  is  a  more 
reliable  alternative,  as  it  guarantees  convergence. 

Remarks: 


1.  Since  Gaussian  noise  is  suppressed  in  the  fourth-order  cumulant  domain,  the  identification 
of  the  channel  response  does  not  take  into  account  the  observation  noise.  Consequently, 
the  proposed  equalizers  work  under  the  zero-forcing  (ZF)  constraint.  For  the  same  reason, 
we  expect  that  the  identification  of  the  channel  will  be  satisfactory  even  in  low  signal  to 
noise  (SNR)  conditions. 

2.  The  ability  of  the  tricepstrum  method  to  identify  separately  the  maximum  and  minimum 
phase  components  of  the  channel  makes  possible  the  design  and  implementation  of  different 
equalization  structures. 

3.  In  the  recursive  formulas  (5.37)  we  used  the  following  properties  that  relate  time  impulse 
responses  with  cepstrum  coefficients:  (i)  a  channel  and  its  inverse  have  opposite  in  sign  cep- 
strum  coefficients,  (ii)  the  cepstrum  coefficients  of  the  convolution  of  two  minimum  phase  or 
two  maximum  phase  sequences,  are  equal  to  the  sum  of  the  corresponding  cepstrum  coeffi¬ 
cients  of  the  individual  sequences  and  (iii)  two  finite  impulse  response  (FIR)  sequences  with 
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conjugate  roots  have  also  conjugate  cepstrum  coefficients.  These  become  unique  features 
of  the  TEA  equalizer  when  is  compared  with  other  equalization  schemes. 

4.  The  described  algorithm  is  based  only  on  the  statistics  of  the  received  sequence  {y(i)}  and 
does  not  take  into  account  the  decisions  {x(t)}  at  the  output  of  the  equalizer.  Consequently 
wrong  decisions  (and  thus  error  propagation  effects)  do  not  affect  the  convergence  of  the 
proposed  equalization  schemes. 

5.  Instead  of  using  the  LMS  algorithm  to  solve  adaptively  the  system  of  equations  (5.30), 
one  may  employ  a  Recursive  Least-Squares  (RLS)  algorithm  [25]  which  will  have  a  faster 
convergence  at  the  expense  of  even  more  computations. 


5.2.4  Power  Cepstrum  and  Tricoherence  Equalization  Algorithm  (POTEA)  [7] 

5.2.5  Relations  of  Power  Cepstrum  and  Tricoherence  of  the  Linear  Filter  Output 

The  problem  is  as  formulated  in  Section  5.2.1,  the  channel  output  y(i)  is  the  convolution  of  the 
non-Gaussian  i.i.d.  random  sequence  x(i)  with  the  channel  impulse  response  /(i)  plus  some 
noise.  The  cepstrum  of  the  power  spectrum  of  the  channel  output  y(i),  can  be  shown  after  some 
algebra  to  be  equal  to  [7], 


ln\A2\ 

m  =  0 

+  B(m)  j 

m  >  0 

m  <  0 

0 

otherwise 
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where  .4^\  are  the  minimum  and  maximum  phase  cepstral  coefficients  of  F(z).  These  are  : 


_4(fc) 


B{k) 


l3 


1=1 


-  Ea"-X>" 

<=1 

L, 

=  E‘f. 


X  —  1 


(5.42) 


where  {a,}  and  {6,}  are  the  zeros  of  F(z)  inside  and  outside  of  the  unit  circle  respectively. 

Remarks: 


1.  A^k\  decay  exponentially  and  thus  their  length  can  be  truncated  in  practice  at  k  —  p, 

so  that  R(pl  are  arbitrarily  small. 

2.  If  the  channel  F(z)  has  cepstral  coefficients  ,  B^k\  its  inverse  filter,  U(z), 
has  cepstral  coefficients  -A^k\  -B^kK  It  is  also  shown  in  [7]  that  if  we  define  5 W  = 
AW  -y  and  ry(k)  =  E{y(i  +  k)y'(i)}  ,  then  the  foUowing  relations  holds: 


Y  S'{k)[-ry(m 
k=  1 


k)}  +  Y  s{k)[ry(m  +  k)]  =  mrl 

^  =i 


(m),  m  =  1,  •  •  -2 p 


(5.43) 


where  p  is  some  integer,  the  choice  of  which  is  discussed  in  [24].  Now  let.  us  consider  the 
cepstrum  of  the  tricoherence. 


Ry{z  1 1  ‘2.  *3)  - 


Sy(~l-  Z-2,  £3) 


(5.44) 


It  has  been  shown  that  the  trispectrum  of  the  received  data  satisfies: 


Sy(z,. 


2.  ~3 


-  lrF'{z\ 


-1 


)F(z2)F'(z;')F(z;'zilz; 


') 


(5.45) 
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Therefore, 


After  some  algebra,  we  obtain 


('.Ad) 


Ry(m,  n,l)  =  - 


/n|.4,| 

m  =  0,  n  =  0,  /  =  0 

m  >  0,  n  =  0,  l  =  0 

_ — _  _g-(— m) J 

3 

A 

O 

3 

II 

O 

1! 

0 

-  5*<n>] 

m  =  0,n>0,/  =  0 

1 

* 

03 

1 

IT 

1 

§ 

Her 

1 

m  =  0,  n  >  0,  /  =  0 

T[A*(m>  - 

3 

II 

3 

II 

«--* 

V 

0 

— [A^-m)  - 

m  =  n  =  l  <  0 

_|[.4-f0  _  5<0] 

m  =  0,  n  =  0,  l  >  0 

m  =  0,  n  =  0,  /  <  0 

0 

otherwise 

Taking  the  logarithm  of  both  sides  of  (5.44),  we  obtain, 

Ry(zi,Z2,Z3)  =  £  [/n5y(z1,22,23)  -  InS* (rf1 , Z21  ’Z31 )] 


(5.47) 


(5.48) 


Differentiating  with  respect  to  Z\  and  performing  inverse  Z-transform,  we  obtain 


2 Ly( m,  n,  /)  *  L‘(-m,  —n,  —l)  *  [-mRy(m,  n,  /)] 

=  L'(  —  m,  -(i,-i)*[-miy(m,n,i)]  +  Ly(m,  n.l)  *  [m£*(m,  n,  /)] 


(5.49) 
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By  defining  the  following  functions: 


-  Ly(—m,  —n,—l)*  Ly(m,  n,  l) 
92{m.n,l)  =  Ly(  —  m,  —n,  —l)  *  mLy(m,n,  l) 


are  combining  (5.49)  and  (5.50),  we  obtain: 


*  [mRy(m,n,l)]  =  02(m,n,l)  +  O^-m, -n, -l) 


Defining  and  combining  (5.47),  we  obtain: 


p 

£>mW[0i(m  -  k,  n  —  *,/  —  *)  —  tfi(m  —  k,  n,  /  )] 

fc-=i 

p 

+  ^2  D^k\0i(m  +  k,n  +  k,l  +  k)  -  9\{m  4-  k,n,l)] 
Jfc-=i 


=  92 (m,  n ,  l)  +  — n,  -/) 


A  rule  of  thumb  is  to  define  w  =  p,  z  <  u’/2,  h  <  z  and  then  take  m  —  —in, . . .,  -1, 1 
-z, . . .  z,  l  —  -h, . . . ,  h  to  form  a  linear  overdetermined  system  to  equations. 

5.3  The  POTEA  Algorithm 

In  this  section  the  POTEA  algorithm  is  given  in  detail. 

Let 

N i,  .V2:  Lengths  of  minimum  and  maximum  phase  components  of  the  equalizer. 
p :  Length  minimum  and  maximum  phase  cepstral  parameters, 

At  iteration  *  =  1,2 . 


(5.50) 


(5.51) 


(5.52) 


, . . . ,  m,  n  ~ 


59 


Step  1  Estimate  adaptively  the  Lyl\m,  n,l)  for  —  M j  <  m,n,/  <  Mi,  and  ry\m)  for  -  A/2  < 
m  <  Mi  from  a  finite  length  window  of  {j/(n)},  and  then  generate  the  following  functions: 


Step  2  Choose  p  arbitrarily  such  that  a  0,  Z?(p+1)  «  0  and  define  w  =  p,  z  <  h  <  z. 

Step  3  Form  the  equations 


p  p 

y]  5*(i)[-ry(m  -  A:)]  +  ]T]  +  *)]  =  mry(m). 


*=1 


k=l 


m  =  1 ,  •  •  ■ ,  2  p 


(5.53) 


where  S^'*  =  A ^  ,  k  =  1 , . . . ,  p. 


p 

y^D*(*)[fli(ro  -  k,n  -  k,l  ~  k)  -  6i(m  -  k,  n,  /)] 
fc=i 
p 

+  +  fc,n  +  fc,/  +  fc)  -  0i(m  -f  fc,n,f)j 

k=\ 


=  02{m,n,l)+  02(-m,  -n,  -/) 


(5.54) 


and  the  following  system  of  equations 


Pa  =  p  (5.55) 

Qb  =  q  (5.56) 


where  the  matrices  P,a,p,Q,b  and  q  are  defined  above. 
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Step  4  Solve  adaptively  the  above  systems  employing  LMS-type  adaptation  as  follows: 


a(i  +  l)  =  a(i)  +  fj.(i)PH(i)e(i)  (5.57) 

b(i  +  1)  =  b(i)  +  n'(i)QH(i)i'(i)  (5.5S) 


where 


i(i) 

i'(i) 

0 

0 


=  p(i)  -  P(i)a(i) 


=  ?(*')- <?(*>(*) 
2 


<  MO  < 

<  M(0  < 


tr(PHP) 

_2_ 

tr(Q"Q) 


The  algorithm  at  instant  i  minimizes  the  mean  square  error: 


MO  =  E{eH(i)e(i)} 

J'(i)  =  E{e'H(  i)e’(i)} 


Step  5  Calculate  and  B W  as  follows: 


(5.59) 


Step  6  Calculate 


1  fc+1 

teq(i,k)  =  —  -  n l),fc  =  1,. .  .,ATi 

n~2 
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oeq(i,k) 


(5.61) 


1 

k 


e  t-uir’iMi.* 

n=fc+l 


—  n  +  l),k  =  -1  ,...,-N2 


with  initialization  :  ie,(i,0)  =  oe,(i,  0)  =  1.  The  normalized  (A  =  1)  estimate  unorm(i,k ) 
at  iteration  (i)  is  given  by: 


^norm(i|  ^)  —  ieg(f?  ^0  *  k) 


(5.62) 


Step  7  Estimate  gain  factor  A(i) 

Step  8  The  reconstructed  transmitted  sequence  at  iteration(i)  is: 


x(i) 


1 

A(z) 


N, 

£ 


k=-N2 


^ norm  ( i,k)y(i-k ) 


(5.63) 


Computational  Complexity 

In  this  section  the  computational  complexity  of  POTEA  is  presented  and  compared  with  the 
computational  complexity  of  TEA. 

PAM 

POTEA:  +  3(2 M  +  1)  +  2 p(Np  +  p  +1)  +  +  (4M)3log2  4 M 

TEA:  +  3(2A/ +  l)  +  (p+q)(2iVp+  1)+ 

QAM 

POTEA:  4[^2.'yii3  +  2(2 M  +  1)  +  2p(2Np  +  4p  +  2)  +  +  (4M)3  log2  4 M] 

TEA:  +  (P  +  q)(2Np  +  1)  4-  'V-^-f+3-] 
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5.4  Cross-Tricepstrum  Equalization  Algorithm  (CTEA)  [8] 

5.4.1  Problem  Formulations 

Assume  we  have  n  measurements  at  each  time  index  k,  y,(fc),j  =  1,2,  ...n,  where 

Vi(k)  =  f,(k)  *  x(k)  +  n,(fc)  (5.64) 

(shown  in  Figure  5.1  for  n  =  4)  and 

1.  fi{k)  is  the  impulse  response  of  a  discrete  time  linear  time  invariant  system, 

2.  x(k)  is  a  non-Gaussian,  nth  order  white  process  with  cumulant  7X  ^  0, 

3.  nt(k )  is  zero-mean  additive  noise,  with  n^(Ar)  independent  of  n.j(k )  for  i  ^  j  and  independent 
of  x(k).  No  assumptions  are  made  about  pdf  for  whiteness  (in  time)  cf  rij(fc). 

We  also  assume  that  each  impulse  response  hi(k)  is  stable  with  no  zeros  on  the  unit  circle  and 
that  its  Z  transform  Fi(z)  can  be  written  as  [8] 

Fi{z)  =  AJi{z~l)Oi{z)  (5.65) 


where  the  A;  are  gain  constants,  the  r,  are  integer  linear  phase  factors, 


h{z~l) 


rfeU-a.^-1) 

nft(i-c0-*-1) 


is  the  minimum  phase  component  and 


^•2 

o.(2)  =  II(1  _  6«>2) 


>=i 
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is  the  maximum  phase  component,  with  zeros  a,-;  and  poles  Cj7  inside  and  zeros  6tJ  outside  the 
unit  circle  ( i.e .  |a,j|  <  1,|6^|  <  1,  and  |c,j|  <  1). 


5.4.2  Relation  of  Cross-Tricepstrum  of  the  Linear  Filter  Output 

With  the  above  assumptic  s,  the  nth-order  cross-spectrum  of  the  y,( k)  can  be  written  as 

n— 1 

Sy,  1,2 n(z l,22,--’,2n-x)  =  7r  ^1  ( -*1 )  ^2  (  *2  )  ■  ■  ■  F„_i(2„_i  )Fn{  2.~1)  (5-66) 

i=l 

Taking  the  logarithm  and  performing  inverse  Z-transform  on  both  sides,  we  obtain  after  some 
algebra  the  following  results: 


In'yx  mi  =  m2  =  . . .  =  mn_i  =  0, 

-(l/m,)i4,(m,)  mi  >  0,  mj  =  0 ,j  <  t, 

*  =  l,2,...,n-  1, 

(1  /m,)R,(— m.)  m,  <  O.m,  =  0 ,j  ^  i, 


cy.i,2,--.n(mi,m2,---,mn_1)  =  (5.67) 

i  —  1,2, ...,n  1, 

— (l/m„)i4n(— mn)  mi  =  m2  =  ...  =  mn_!  <  0, 

-(1  /mn)Bn(mn)  mi  =  m2  =  ...  =  mn_i  >  0, 

0  otherwise 


with 


Mk)  =  f>,j  >*- £(««)* 

1=1  ;=1 

a 

B,  (k)  =  £(M*-  (5.68) 

;=i 
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This  results  means  that  the  n-th  order  cross-cepstrum  is  non-zero  on  n  lines  only  in  its  domain 
and  that  on  each  of  these  lines  we  find  the  complex  cepstrum  of  a  zero-linear  phase,  scaled  version 
of  one  of  the  n  impulse  responses. 

Now,  to  develop  a  least  squares  solution  for  the  .4,  and  B{,  we  take  first  partial  derivatives  of 
the  logarithm  of  (5.66),  independently  with  respect  to  each  of  its  variables,  followed  by  inverse 

Z  transforms.  Letting  Syt  1,2 . n(mi,m2,  •  •  . ,ran_i)  denote  the  n-th  order  cross  cumulants  of  the 

yt,  we  get  the  following  n  —  1  equations  relating  the  cross  cumulants  to  the  cepstral  coefficients: 

‘^j/,l,2,...n(^l t  ^2»  •  •  • »  *  (^t  Cy,  1,2, .. .,*1(^1  >  ^2>  •  •  •>  ^n— l)) 

=  77l«  ‘Sy,l,2,...n(^tl i  7^2 »  •  •  • )  ^n— 1 ) 

for  i  =  1,2,  ...n  —  1.  Each  equations  involves  an  (n  —  1)  dimensional  convolution.  However, 
plugging  in  (5.67)  reduces  each  equation  to  a  single  finite  summation: 

00 

Y  Mk)Sy,l,2,..,n(tut2,  ■  ■  ■  ,*n-l)  ~  B,(k)SyAa . n(«l,  ^2,  •  •  • ,  «n-l) 

k=l 

~An(k)Sytl<2 . „(m<  +  k,  TTii  +  &, . . .,  m,  +  k) 

+fin(fc)5y,i,2,...,n(mj  -  &,m,  -  k,...,mi-  k) 

—  THi  ‘?y,l,2,...,n(^l  >  ^12*  •  •  •  >  ^n— 1 )  (5.69) 

where 

ti  =  m,  —  k 
m  =  mi  +  k 
tj  =  Uj  =  m}  j  ±  i. 
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From  equation  (5.68)  the  sums  in  (5.69)  decay,  so  we  can  truncate  them  to  Pi  and  g;  for 
the  terms  involving  A,  and  B,  respectively  (see  [8])  and  rewrite  (5.69)  as  a  finite  dimensional 
vector  dot  product  equation.  Writing  M  >  p,  +  g,  +  Pn  +  gn  equations  at  M  points  in  the  n  —  1 


dimensional  domain  of  5y>i  2,...n  we  can  form  the  overdetermined  system 

&„•£„  =7m  (5.70) 

5.4.3  Cross-TEA  (CTEA)  Algorithm 

In  this  section  we  describe  the  CTEA  algorithm  for  blind  equalization  of  QAM  signals  with  four 
receivers.  The  algorithm  has  two  stages  at  each  iteration: 

1.  Channel  identification  and  deconvolution 

2.  Combining  by  use  of  a  decision  rule 
Channel  Identification  and  Deconvolution 

Step  1.  Estimate  the  cross-cumulants  and  kurtoses  of  the  received  data  recursively. 

Step  2.  Form  the  systems  of  equations  (5.70)  and  solve  each  system  in  turn  to  get  the  cepstral 
coefficients  for  each  channel1 

Step  3.  From  the  results  of  the  previous  step,  estimate  the  forward  and  inverse  channel  impulse 
responses  up  to  a  desired  length. 

Step  4.  From  the  estimated  forward  impulse  response  and  the  kurtoses,  estimate  the  gains  A,-^  for 
each  channel. 

‘The  cepstral  coefficients  for  channel  four  can  be  estimated  from  the  solution  of  one  of  the  three  systems  or  an 
average  of  all  three. 
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Step  5.  With  the  estimated  inverse  response,  /^jnV(^)>  an<l  the  estimated  gain  for  each  channel, 
deconvolve  to  estimate  the  input  symbol  as 

i,(j)=  •4jyy,0)*4jnV(*) 

Ai 

Combining  Decision  Rules 

As  illustrated  in  Figure  5.1,  from  the  four  estimates  x,(j)  we  need  to  form  a  single  quantized 
decisions  x(j).  We  describe  here  an  optimal  combining  rule  in  the  case  of  a  perfect  equalizer,  as 
well  as  three  sub-optimal  schemes,  arithmetic  mean,  majority  rule,  and  median  (which  for  n  =  4 
channels  is  equivalent  to  a-trimmed  mean  with  a  =  1). 

Optimal  Decision  for  the  Perfect  Equalizer  [8] 

We  consider  the  following  assumptions: 

1.  x(k)  is  complex  and  uniformly  distributed, 

2.  Ui(k )  is  the  perfect  equalizer  for  /,(Ar),  i.e.  fi(k)  *  U{(k)  =  6(k),  and 

3.  n,(fc)  are  zero-mean,  complex  Gaussian  variables  with  known  variance  <rf  and  are  indepen¬ 
dent  across  channels., 

Since  we  will  do  symbol  by  symbol  detection,  we  will  drop  the  time  index  k  for  simplicity.  With 
these  assumptions, 

A 

x,  =  x  +  n,  *  Ui  —  x  +  n,  . 

Therefore,  the  conditional  probability  density  of  x  given  X,  p(x\x),  is  complex  Gaussian  with 
mean  x  and  variance 

=  ct.22Im*oi2- 

k 
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Since  the  noise  in  each  channel  is  independent,  the  maximum  likelihood  estimate  x  of  x  given 


the  four  observations  it  (assuming  x  to  be  from  a  continuous  distribution)  is 

\  E. 

E.-*r2 
i  e, 

E,  *r2 

where  the  subscript  R  and  I  denote  real  and  imaginary  parts  respectively.  Note  that  if  the  noise 
has  the  same  variance  in  all  channels  then  this  result  reduces  to  the  arithmetic  man.  If,  on  the 
other  hand,  we  assume  that  x  belongs  to  a  known  discrete  set  V  then  we  need  to  find  x  £  V 
which  satisfies 

min  y1  <r~2\xi  —  i|2 
1  1  1 

I 

or  equivalently 

min  y  d~2  (|i|2  -  2(xRxitR  +  i/ii,/))  . 

t 

Of  course  the  assumptions  of  perfect  equalization  and  known  noise  variance  are  not  realistic  in 
practice  so  we  describe  below  three  sub-optimal  combining  rules  which  we  tested  in  our  simula¬ 
tions. 

Arithmetic  Mean 

Step  1.  Form  a  soft  decision  statistic 

*(»  = 

4  <=i 

(If  information  is  available  about  the  relative  quality  of  the  channels  then  a  weighted  mean 
could  be  used.) 


xR  = 

xi  = 
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Step  2.  Put  x(j)  through  a  decision  device  to  get  x(j). 

Majority  Rule 

Step  1.  Put  each  estimate  through  a  decision  device  to  form  four  decision  statistics 

Step  2.  If  there  is  a  plurality  among  the  £i(j)  in  one  region  of  the  decision  space  then  that  is  the 
decision.  If  there  is  a  tie  (  all  four  different  or  two  votes  for  each  of  two  decisions)  use 
a  tie-breaking  procedure.  One  method  would  be  to  pick  the  decision  region  that  has  the 
smallest  average  squared  decision  error.  For  example,  if  £i(j )  =  x2 (j)  ^  £3 (j)  =  £4 (j): 

Let  di  =  53  |i<(j)  -  x<(i)|2 

t=i 

Let  d2  =  53  “  *i(i)lS 

«=3 

Then 

Choose  £i(j)  <  d2 

x2(j)  d2>  dx. 

Median 

Step  1.  Order  the  real  and  imaginary  parts  of  the  x,(j)  separately. 
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Step  2.  Set 


REAL{x(j)}  =  median  (REAL{ij(j)}} 

IMAG{x(j)}  =•  median{IMAG{ij(j)}} 

Step  3.  Put  x(j)  through  the  decision  to  get  x(j). 

5.5  Computer  Simulations 

Computer  simulations  has  been  employed  to  compare  the  performance  of  the  blind  equalization 
algorithms.  The  performance  metric  used  are  those  in  Sections  2.  And  the  following  issues  are 
addressed. 

5.5.1  TEA  vs.  Bussgang-type  Algorithms 

Fig.  5. 2-5.4  show  the  performance  of  the  TEA  algorithm,  compared  with  that  of  Bussgang- 
type  algorithms,  such  as  Godard,  Benveniste-Goursat,  Stop-and-Go  algorithms.  We  see  that  the 
TEA  algorithm  opens  the  eye  much  faster  than  the  Bussgang-type  algorithms.  This  performance 
improvement  is  achieved  at  the  expense  of  larger  computational  complexity. 

5.5.2  POTEA  vs.  TEA 

Fig.  5. 5-5.6  show  the  performance  of  the  POTEA  algorithm,  compared  with  that  of  TEA.  We 
see  that  the  POTEA  algorithm  converges  faster  than  the  TEA  algorithm.  The  performance 
improvement  is  achieved  at  the  expense  of  further  increase  in  computational  complexity. 
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5.5.3  CTEA  vs.  TEA 


Fig.  5. 7-5. 8  show  the  performance  of  the  CTEA  algorithm  compared  with  that  of  TEA  algorithm. 

We  see  that  the  CTEA  algorithm  converges  faster  than  the  TEA  algorithm  for  some  channels. 

The  performance  improvement  is  achieved  at  the  expense  of  further  increase  in  computational 
complexity. 

6  ALGORITHM  WITH  NONLINEARITY  INSIDE  THE  EQUAL¬ 
IZATION  FILTER 

Still  another  class  of  bind  equalization  algorithms  are  those  algorithms  which  use  Volterra  filters 
[9],  [10]  or  neural  networks  [20],  [26],  [27],  This  class  of  algorithms  perform  nonlinear  operations 
inside  the  equalization  filter.  It  is  therefore  also  be  able  to  correctly  extract  the  phase  information 
of  the  unknown  channel  from  its  output  only.  In  this  section,  we  will  concentrate  on  those 
algorithms  based  on  neural  network. 

6.1  Review  of  Equalization  Techniques  Based  on  Neural  Networks 

Equalization  is  a  technique  which  is  used  to  combat  the  intersymbol  interference  caused  by  non¬ 
ideal  channels.  Usually,  equalizers  are  implemented  using  linear  transversal  filters  [17],  [18],  [30], 

[31].  However,  when  the  unknown  channel  has  deep  spectral  nulls  or  some  severe  nonlinear 
distortions,  such  as  phase  jitter  and  frequency  offset,  linear  equalizers  are  not  powerful  enough 
to  compensate  all  of  these.  That  is  why  nonlinear  filters,  such  as  those  implemented  by  Volterra 
filter  or  neural  network,  come  in  and  play  an  important  role. 

Neural  Networks  (NNWs)  are  mathematical  models  of  theorized  mind  and  brain  activities. 

The  fundamental  idea  of  NNWs  is  to  organize  many  simple  identical  processing  elements  into 


71 


layers  to  perform  more  sophisticated  tasks.  The  properties  of  NNWs  include:  massive  paral¬ 
lelism;  high  computation  rates;  great  capability  for  non-linear  problems,  continuous  adaptation; 
inherent  fault  tolerance  and  ease  for  V'LSI  implementation,  etc.  All  these  properties  make  NNWs 
attractive  to  various  applications.  Several  neural  network  based  algorithms  have  been  proposed 
for  equalization  problems. 

1  Multi-Layer  Perceptron 

The  multi-layer  Perceptron  (MLP)  [39],  [40]  is  one  of  the  most  widely  used  implementations 
of  NNWs.  It  comprises  a  number  of  nodes  which  are  arranged  in  layers,  as  show'n  in  Figure 
6.1.  A  node  receives  a  number  of  inputs  Xj,X2,  •  •  •,  xn,  which  are  then  multiplied  by  a  set 
of  weights  Wi,  ti'2,  •  ■  ■ ,  wn  and  the  resultant  values  are  summed  up.  A  constant  v  is  added 
to  this  weighted  sum  of  inputs,  known  as  the  node  threshold,  and  the  output  of  the  node 
is  obtained  by  evaluating  a  nonlinear  (sigmoid)  function,  /(.),  which  is  called  activation 
function. 

The  architecture  of  a  perceptron  can  be  described  by  a  sequence  of  integers  n 0, 712,  •  •  • ,  n^. 
where  n0  is  the  dimension  of  the  input  to  the  network,  and  the  number  of  nodes  in  each 
successive  layer,  ordered  from  input  to  output,  is  ni,»2, ••*,»*.  In  this  notation,  the  MLP 
produces  a  nonlinear  mapping  g  =  Rno  —  Rn* . 

The  updating  of  the  connection  coefficients  of  the  MLP  is  done  iteratively  by  using  back- 
propagation  (BP)  algorithm  with  the  following  formula: 


(iul+i,t;,+1)  =  +  A, 


(6-1) 
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and 


A.  =  ~(/^'/)  •  -rr— - r  +  a  •  A,_i. 

d(wt,Vi) 


(6.2) 


2  Self-Organizing  Feature  Maps 

The  topology  by  self-organizing  feature  map  (SOM),  which  is  introduced  by  Kohonen  [26], 
[27]  consists  of  two  layers  of  nodes,  referred  to  as  input  layer  and  output  layer,  which  are 
fully  connected  with  different  connection  weights.  The  inputs  to  the  SOM  can  be  any 
continuous  values,  whereas  each  of  the  output-layer  node  represent  a  pattern  class  that  the 
input  vector  may  belong  to.  That  means  the  outputs  of  SOMs  are  discrete  values,  and 
therefore,  the  SOM  is  sometimes  also  referred  to  as  learning  vector  quantizer. 

The  SOM  works  iteratively  as  follows.  First,  find  the  set  of  connection  coefficients  Wg 
which  is  the  closest  to  the  input  vector  A 


Ak  -Wg  ||=  pain  ||  Ak  -  W} 


(6.3) 


Second,  perform  the  following  quantization  of  the  output-layer  node: 


bg=< 


1,  if  II  Ak  -  Wg  ||=  min  ||  Ak  -  Wj 
0,  otherwise. 


(6.4) 


and  then  move  Wg  closer  to  Ak  using  the  equation 


bWa  =  { 


a(*Ha?-Ug,  j  =  g 
m-i«!-wt]),  j€Nr,j*9 

0,  j$Nr, 


(6.5) 
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where  Nr  is  the  topological  neighborhood  of  the  winning  node  bg  which  consists  bg  itself 
and  its  direct  neighbors  up  to  the  depth  1,2, •••,  and  a(k)  and  0(k)  are  the  learning  rate 
at  time  k. 

6.2  The  MLPs  Equalization  Algorithm  for  PAM  and  QAM  Signals 

The  applications  of  MLP  in  equalization  problems  so  far,  have  been  limited  to  binary  {0,1}  or 
bipolar  {-1,1}  valued  data  and  real  valued  channel  models  [11],  [20],  [49].  In  this  section,  we 
introduce  for  the  first  time  a  new  implementation  structure  of  MLP  which  works  well  with 
L-PAM  (L  >  2)  and  N-QAM  (N^,4)  signals. 

Looking  into  a  MLP  structure,  we  find  out  that  it  is  the  sigmoid  function  of  the  output 
layer  nodes  that  confines  the  network  outputs  to  the  range  [—1,1].  In  our  equalization  problem, 
the  signals  are  equally  spaced  and  symmetric  with  respect  to  either  the  original  point  of  the 
coordinate,  or  to  the  x  and  y  axes.  Thus  we  can  just  scale  up  the  node  function  of  the 
output  layer  by  a  constant  factor  C  which  is  large  enough  to  cover  our  maximum 
signal  range,  e.g.,  [-15,15]  for  16-PAM  or  256-QAM  signals.  So,  for  the  output  layer,  we  have 
[30],  [40] 

/M(x)  =  C-i^,  (c>1)  (6.6) 

f 

as  the  activation  function.  For  the  hidden  layers,  we  still  use  the  sigmoid  function 

(6-7) 

The  idea  of  adding  another  constant  a  comes  form  the  thought  that  a  smaller  a,  equivalently, 
a  lower  slope  in  Figure  6.2,  would  avo'd  high  vibration,  and  in  turn,  decrease  the  chance  of 
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divergence  in  the  course  of  weight  adjustment.. 

lor  complex  channel  models  and  QAM  signals,  we  use  complex  connection  coefficients  to 
get  the  weighted  sum  to  which  a  complex  threshold  is  added.  Then  the  sigmoid  functions  of 
the  real  and  the  imaginary  parts  of  the  threshold  added  weighted  sum  are  evaluated  separately. 
Again,  for  the  output  layer  nodes,  the  outputs  are  multiplied  by  a  constant  C .  Using  the  steepest 
descent  formula  (Eq.  6.1,  6.2).  we  get  the  adaptation  algorithm  of  our  new  MLP  equalizer  which 
is  described  in  Table  6.1  [30],  [40]. 

Simulation  are  conducted  to  examine  the  performance  of  MLP  equalizers.  The  equalizer  is 
implemented  by  the  new  MLP  structure  with  only  one  output  node.  The  input  data  to  the 
system  x,  are  assumed  to  be  independent  of  each  other.  The  delayed  input  sequence  x t_d,  where 
d  is  channel  dependent,  is  used  as  the  training  sequence.  The  performance  of  MLP  equalizers  is 
evaluated  by  calculating  the  mean  square  error  (MSE)  E[{x  -  x)2]  and  the  average  symbol  error 
rate  (SER)  of  the  quantizer  output.  The  eye  pattern  of  equalizer  outputs  around  certain  number 
of  iterations  is  shown  in  Figure  6.3. 

Figure  6.4  illustrates  the  performance  comparison  between  MLP  and  LMS-based  linear  transver¬ 
sal  equalizer  with  the  same  number  of  inputs.  The  structure  (the  number  of  nodes  in  the  hidden 
layer)  of  the  MLP  has  been  fine-tuned  through  experiment.  The  step  size  n  of  the  LMS-based 
equalizer  is  also  optimized  (the  biggest  value  without  causing  divergence).  From  Fig.  6,  it  ap¬ 
pears  that  the  new  structure  of  MLP  works  no  much  better,  as  a  channel  equalizer,  than  the 
simple  linear  adaptive  equalizer.  As  a  matter  of  fact,  both  methods  end  giving  similar  results. 
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7  CONCLUSIONS 


The  purpose  of  this  paper  is  to  provide  a  tutorial  review  of  existing  blind  equalization  algorithms 
for  digital  communications.  Three  families  of  techniques  have  been  described,  namely,  the  Buss- 
gang  techniques,  the  polyspectra-based  techniques,  and  methods  based  on  nonlinear  equalization 
filters  or  neural  networks.  The  complexity  of  the  Bussgang  techniques  is  approximately  2 N  mul¬ 
tiplications  per  iteration,  where  N  is  the  order  of  the  linear  equalization  filter.  On  the  other 
hand,  the  polyspectra-based  techniques  require  approximately  |jV3  multiplications  per  iteration. 
However,  as  it  has  been  demonstrated  in  the  paper,  the  polyspectra-based  techniques  achieve 
significantly  faster  convergence  rate  than  the  Bussgang  techniques.  Finally,  it  is  pointed  out  in 
the  paper  that  blind  equalizers  based  on  nonlinear  filters  or  neural  networks  are  better  suited  for 
equalization  of  channels  with  nonlinear  distortions. 
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Table  4.1  Nonlinear  Functions  of  Bussgang  Iterative  Techniques. 
U(»)  =  [«i, (»)>••  •.“aKOF  equalizer  taps 

y(i)  =  [y(t),  ■  •  • ,  y(i  -  N  +  1)]T  input  to  the  equalizer  block  of  data 
At  iterationfi},  i  —  1,2, 

*(*)  =  uH(»)  y(0 

«(*)  =  3(°[*(0]  ~  *(») 

u(!  +  l)  =  u (i)  +  n  y(0  «•(*) 


Algorithm 

Nonlinear  function:  y[x(i)j  = 

LMS 

training 

mode 

x(t)  (linear) 

Decision 

Directed 

Mode 

x(i) 

Sato 

7  csgn  [*(!)] 

Benveniste- 

Goursat 

x(i)  +  ki  (x(!)  -  x(i))  +  *j|*(i)  -  i(t)|- 

(7  csgn[x(i)j  - 

Godard 

Pi?  =  2 

1ffiiI.{|x(!)!  +  «P!x(i)r1-|x(!)|2*’-1} 

Stop-and-Go 

i(i)  +  M(x(!)  -  x(!))  +  \B{x(i)  -  f(i))‘ 
(A.B)  =  (2,0),  (1,1),  (1,-1)  or  (0,0),  depending 
on  the  signs  of  DD  and  Sato  errors 
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Table  4.2  Comparison  of  Computational  Complexity 


Godard 

CRIMNO 
(memory  size  M) 

Adaptive  Weight  CRIMNO 
(memory  size  M) 
Version  I 

Version  I 

Version  II 

Real  Multiplication 

4N+5 

4N+8M+5 

MN+8M+4N+5 

4N+10M+5 
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Table  6.1  Complex  MLP  adaptation  algorithm. 


1) .  Assign  small  random  complex  numbers  to  all  the  connections  and 

thresholds. 

2) .  Forward  propagate  inputs  through  the  network: 

n, 

“*•'  ‘  </j  +  V'.J  =  ai+ tj  +  J  •  hf+l 

1=1 

“•+ij  =  /(«/+ifJ)  +  J  ■  /(a?+u). 

where  i  =  1  ,  •■•Af  (A/  is  the  number  of  layers),  /(•)  is  the  sigmoid 
function,  and  get  the  output, 


x  =  C  ■  a. . 


3) .  Present  the  training  signal  to  find  the  output  error, 

em  =  4,  [l  -  (i'/c)2]  /c  +  je%  [l  -  (i«/C)2]  /C 

where  cm  —  Xi-A  -  x. 

4) .  Find  the  backpropagation  error, 

etJ  =  sij  •  (1  "  (a/,)2)  +  J  ■  •  (1  ~  (a£)2]< 

where 

”i+i 

® i, j  =  'y  + 1 - 

i=i 

5) .  Adjust  connections  and  thresholds: 

u>i,j,k(n  +  1)  =  wi,j,k{n)  +  V  ■  c*+i,j(«)  '  a.>(n). 

Vij (n  +  1)  =  Vij(n)  +  0  •  eij(n). 

where  denotes  conjugate  operator.  The  momentum  term  can  also 
be  added. 

6) .  Back  to  Step  2. 
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Figure  2.  2  (a)  The  Bussgang  Algorithms:  Nonlinearity  is  in  the  Output  of  Equalization  Filter. 
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Figure  2. 2  (c)  Blind  Equalizers  with  Nonlinearity  Inside  the  Equalization  Filter. 
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Equalization  Filter  with  Nonlinearity  in  the  Output. 
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MS  Estimate  Under  Uniform  Distribution 


Figure  4.2  MS  Estimate  Under  Uniform  Distribution 


MS  Estimate  Under  Laplace  Distdbutiaa 


Figure  4.3  MS  Estimate  Under  Laplace  Distribution 
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Figure  4.5  MAP  Estimate  Under  Laplace  Distribution 
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Figure  4.6  Comparison  of  the  adaptive  weight  CRIMNO  algorithm  (szl)  with  Godard’s 
algorithm  of  different  step-size  (sz3  is  the  optimum  step-size):  (a)  the  real 
channel;  (b)  the  synthetic  channel. 


96 


(c)  xlO4 

Figure  4.7  Effect  of  memory  size  M  on  the  adaptive  weight  CRIMNO  algorithm: 

(channel  2)  (a)  Mean  square  error,  (b)  Probability  of  error;  (c)  Intersymbol 
interference. 
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Figure  4.8  Eye  pattern  of  adaptive  weight  CRIMNO  algorithms  with  different  memory  size 
M  at  iteration  20000.  (a)  Godard;  (b)  Adaptive  weight  CRIMNO  (M=2j;  (c) 
Adaptive  weight  CRIMNO  (M=4)\  (d)  Adaptive  weight  CRIMNO  (M=6). 
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Figure  5.3  Channel  G201  with  QAM  -64;  Godard  algorithms. 
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Figure  5.6  Performance  comparison  of  the  POTEA  algorithm  with  the  TEA  algorithm. 
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MOMENTS,  CUMULANTS  AND  SOME  APPLICATIONS  TO 
STATIONARY  RANDOM  PROCESSES 

BY  DAVID  R.  BRILLINGER* 

University  of  California,  Berkeley 

The  paper  ranges  over  some  basic  ideas  concerning  moments  and 
cumulants,  focusing  on  the  case  of  random  processes.  Uses  of  moments 
and  cumulants  in  developing  large  sample  approximate  distributions,  in 
system  identification  and  in  inferring  causal  connections  of  a  network  of 
point  processes  are  presented. 

1.  Introduction.  Moments  and  cumulants  find  many  uses  in  main 
stream  statistics  generally  and  with  random  processes  particularly. 
Moments  reflect  the  parameters  of  distributions  and  hence,  as  via  the 
method  of  moments,  may  be  used  to  estimate  distributional  parameters. 
Moments  may  be  employed  to  develop  approximations  to  the  statistical 
distributions  of  quantities,  such  as  sums  in  central  limit  theorems  and  asso¬ 
ciated  expansions.  Moments  may  be  used  to  study  the  independence  of 
variates.  Moments  unify  diverse  random  processes,  such  as  point 
processes  and  random  fields,  and  diverse  domains,  such  as  the  line  or 
space-time. 

2.  Ordinary  case.  One  can  begin  by  asking:  What  is  a  moment?  To 
provide  an  answer  to  this  question,  consider  the  case  of  the  0-1  valued 

’"Research  partially  supported  by  NSF  Grant  DMS-8900613 
AMS  1980  subject  classifications.  Primary  62M10,  62M99. 

Key  words  and  phrases.  Coherence,  cumulant,  moment,  partial  coherence,  point  pro¬ 
cess,  system  identification,  time  series. 
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variates  X ,  Y ,  Z .  For  these  variates 


E[XYZ }  =  ProhfX  =  1,  Y  =  1,Z  =  1} 

This  provides  an  interpretation  for  a  (third-order)  moment  in  terms  of  a 

quantity  having  a  primitive  existence,  namely  a  probability.  Higher-order 
moments  have  a  similar  interpretation.  One  can  proceed  to  general  ran¬ 
dom  variables,  by  noting  that  these  may  be  approximated  by  step  (or  sim¬ 
ple)  functions,  see  eg.  Feller  (1966),  page  107. 

Next  one  can  ask:  What  is  a  cumulant?  One  answer  is  to  say  that  it 
is  a  combination  of  moments  that  vanishes  when  some  subset  of  the  vari¬ 
ates  is  independent  of  the  others.  Suppose  for  example  that  X  is  indepen¬ 
dent  of  (Y ,  Z).  The  third  order  joint  cumulant  may  be  defined  by 

cum{X,Y,Z}  =  (1) 

E  [XYZ }  -E{X)E[YZ)  - E{Y}E{XZ }  -E{Z)E{XY)  +2  E[X)E{Y)E[Z) 


By  substitution  one  quickly  sees  that  this  last  expression  vanishes  in  the 
case  that  X  is  independent  of  (Y ,  Z). 

Expresion  (1)  gives  one  definition  of  a  joint  cumulant.  An  alternate 
way  to  proceed  is  to  state  that  that  cumulant  is  given  by  the  coefficient  of 
i3aPy  in  the  Taylor  expansion  of 


log 


£|ei(aX+pr+YZ)j 


supposing  one  exists. 


Taking  the  log  here  converts  factorizations  into  additivities  and  one  sees 
immediately  why  the  joint  cumulants  vanish  in  the  case  of  independence. 

Streitberg  (1990)  sets  down  a  sequence  of  conditions  that  actually 
characterize  a  cumulant.  These  are: 

1.  Symmetry 
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cum[X \,X2,  •  •  •  }  =  cwm  {X2,  Xj, 

2.  Multilinearity 

cum  (aX  j,  X2,  •  •  •  }  =  acum  [X  {,X2,  *  *  *  } 

cum  [X  {+Y  X 2,  •••  }=cum{Xl ,  •  •  •  }  +  cum  {Yj,  •••} 

3.  Moment  property,  if  the  moments  of  X  and  Y  are  identical  up  to  order 
k 


cum  {X}  =  cum  {Y} 

4.  Normalization,  in  the  expansion  in  terms  of  moments 

cum  {Xj,  •  •  •  ,  Xk  }  =  E  {X  j  •  •  •  X*  }  + 

5.  Interaction,  if  a  subset  is  independent  of  the  remainder 

cum  {Xj,  •  •  •  JCji }  =  0 

Cumulants  provide  a  measure  of  Gaussianity.  If  the  variate  X  is  nor¬ 
mal,  then 


cumk  (X }  =  0  (2) 

for  k  >  2.  (Here  cumk  denotes  the  joint  cumulant  of  X  with  itself  k 

times.)  Putting  (2)  together  with  the  fact  that  the  normal  distribution  is 
determined  by  its  moments,  provides  a  particularly  brief  proof  of  the  cen¬ 
tral  limit  theorem.  Namely  suppose  that  Xj,X2,  •  •  •  are  independent 
and  identically  distributed  with  E  {X }  =0  and  var[X }  =  1.  Suppose  all 
moments  exist  for  X .  Consider 


Then 


—  (x  j  +  '  •  •  +  Xn  )/V/T 


(3) 


k_ 

cumk  {S„  }  =  n  cumk  {X }  /  n  2 


which  tends  to  0  for  k  >  2,  as  n  tends  to  infinity,  and  in  consequence  Sn 
has  a  limiting  normal  distribution. 


no 


An  error  bound  may  be  given  for  the  degree  of  approximation  of  the 
distribution  of  a  random  variable  by  a  normal,  via  bounds  on  the  cumu- 
lants.  In  Rudzkis  et  al.  (1978)  the  following  result  is  developed.  Con¬ 
sider  a  variate  Y  with  mean  0  and  variance  1.  Suppose  that 


I  cumk[Y) 


~  a*-2 


for  some  v  >  0,  H  >  1 ,  then  in  the  interval  0  <  u  <  &/H 


where 


sup  I  Prob  { Y  <  u  }  -  <b(u )  I 
u 


< 


18  H 
8 


5  =  y 


V2A 


'j  l/(l+2v) 


In  the  case  of  a  sum,  such  as  (3),  one  can  take  A  =  V/i"  for  example. 


3.  Time  series  case.  Consider  a  stationary  time  series  X(t)  with 
domain  /  =  0,  ±1,  ±2,  •  •  •  .  If  the  k-th  moment  exists,  from  the  sta¬ 
tionary,  the  moment  function 

£{X(r-Hi1)  —  X(/+w*_1)X(0} 
will  not  depend  on  t ,  nor  will  the  associated  cumulant  function 

C*(Uj,  ‘  >  W/fc—j) 

=  cum  [X(t+u  j),  •  •  •  X(t+uk_x)J((t)}  (4) 

The  Fourier  transforms  of  these  ck(.)  give  the  higher-order  spectra  of  the 
series.  These  functions  may  be  estimated  given  stretches  of  data. 

It  was  indicated,  by  property  5  above,  that  a  joint  cumulant  measures 
statistical  dependence.  This  suggests  formalizing  the  intuitive  notation  that 
values  at  a  distance  in  time  are  not  strongly  dependent  via 

£  •  •  •  £  I ck(uv  •  •  •  ,  w^_j) I  <  oo  (5) 

«!  «*-l 
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for  k  =  2,  •  •  •  .  It  is  now  direct  to  provide  a  central  limit  theorem  for 
sums  of  values  of  a  stationary  time  series.  One  has 

{£*(/)/  Vf ) 

1 

=  X  •  •  •  2  •  *  ■  •  1  r*/2 

t  u,  «*., 

-  - 

=  Ic2(u)  *  =2 

u 

and 

->  0  *  >  2 

following  (5),  giving  the  limit  normal  distribution. 

Another  aspect  of  the  use  of  cumulants  is  that  a  calculus  exists  for 
manipulating  polynomials  in  basic  variates.  Suppose  that 

Y  =g(Xl,  -  ,XL) 

=  £«/,...  4*5'  •  •  •  x£  (6) 

i 

One  has  directly  from  (6)  that 

£(r*)  =  ZPrai  ■■■  „*£(*?  •••  x?) 

m 

but  perhaps  more  usefully,  there  are  rules  due  to  Fisher,  see  Leonov  and 
Shiryaev  (1959),  Speed  (1983),  providing  an  expression 

cumk  ( Y  }  =  2  Yo  cum  ixj  ’■  J  £  Gi )  *  '  '  cum  ( Xj  :  j  EOp) 
a 

where  a  =  (olt  •  •  •  ,  ap)  is  a  partition  of  subscripts  into  blocks  and  the 
Y0  are  coefficients. 

A  time  series  analog  of  an  expansion,  like  (6)  for  ordinary  variates,  is 


provided  by  the  Volterra  expansion 

Y(t)  =  a0  +  Za](t-u)X(u)+  £  a2(t-uut-u2)X(ul)X(u2)+  •  •  <7) 

u  U  \,U2 

Using  the  Cramer  representation  of  the  process,  namely 

X(t)  =  \eitXdZx(X) 

(7)  may  be  written 

a0  +  J  eitXA  X(K )dZx(K)  +  jj  eiaXx+X2)  A  ^X^dZ^X^dZ^)  +  •  •  • 

in  terms  of  the  Fourier  transforms  of  the  a  j(.),  a  2(.),  •  •  •  .  This  form 
often  simplifies  the  development  of  particular  analytic  results. 

Consideration  now  turns  to  the  use  of  moments  and  cumulants  in  the 
identification  of  nonlinear  systems.  In  the  case  of  a  polynomial  system 
like  (7),  Lee  and  Schetzen  (1965)  develop  estimates  of  the  functions 
a  j(.),  a2(.),  •  •  •  via  empirical  moments  of  the  form 

4r  X  X(t+Ui)  •  •  X(l+uk)Y(t) 

1  t= 0 

for  the  case  that  the  input,  X  (.),  is  Gaussian  white  noise. 

For  the  case  of  stationary  Gaussian  input  and  a  quadratic  system 

Y{t)  -  a0  +  £  a  \(t~u  )X (u )  +  £  a2(t-uvt-u2)X(ux)X(u2)  +  noise 

U  «1.«2 

Tick  (1961)  developed  an  estimation  procedure  as  follows.  Define  the 
cross-spectrum  and  cross-bispectrum  via 

cum  { dZx  (X),dZY  (ji) }  =  5(A,+p)/Xy  (X)d  Xd  p 

cum  {dZx (Xx\dZx (X^tdZy (Xg) }  =  xxY^X‘\'Xci)^X'\^'X/l^'X'i 

respectively.  One  has 

/  YX  =  ^  1  (X)f  (X.) 

f XXy(~^-1~^2)  =  ^2^~^\~^2)fxx^0fxx(^2) 
relations  from  which  estimates  of  the  transfer  functions,  A ,  may  be 
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developed,  based  on  estimates  of  the  spectra  that  appear. 

Another  system  that  may  be  identified,  in  a  like  manner,  takes  the 
form,  for  input  X  (.)  and  output  Y  (.), 

U(t)  =  ^da(t-u)X(<u) 

u 

V(t)  =  G[U(t)] 

Y (/ )  =  \l  +  £  b (t -u)V (u)  +  noise 

u 

i.e.  involves  an  instantaneous  nonlinearity,  G  [.],  and  two  linear  filters.  In 
the  case  that  X  (.)  is  stationary  Gaussian,  one  can  develop  the  relationships 

frx(X)  =  LlA  (X)B(X)fxx(X) 

fxxY&  1*^2)  =  E^A  "X^B  {—Xi—Xy)fxx(X])fxx(X~i) 

where  L  j,  L2  are  constants.  See  Korenberg  (1973)  and  Brillinger  (1977). 
Estimates  of  the  identifiable  unknowns  may  be  developed  based  on  esti¬ 
mates  of  the  spectra  appearing. 

4.  Point  process  case.  Consider  isolated  points,  x*,  scattered  along 
the  real  line.  Let  N  ( t )  count  the  number  in  (0 ,t  ]  and  dN(t)  the  number  in 
the  small  interval  (t,t+dt],  Typically  dN(t)  will  be  0  or  1. 

The  k-th  order  product  density  of  the  point  process  N(.)  is  pk(.) 
given  by 

E{dN(t j)  •  •  •  dN(tk)} 

=  Prob[dN(tl)=l,  -,dN(tk)=l) 

-Pk^v  '  *  ’  .  h)dt\  •  •  •  dtk 

for  fj,  •  •  •  ,  tk  distinct  and  k  =  1,  2,  •  •  •  .  This  relates  to  the  moments 
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of  the  process  as  follows.  Write  =  N(N- 1)  •  •  •  (N-k  + 1),  then  the 

k-th  factorial  moment  of  N(t)  is 

t  t 

E[N(tfk))  =  J  *  •  \pk(tx,  •  ■  ,  tk)dtl  •••  dtk 

o  o 

The  corresponding  cumulant  density  is  given  by 

cum  [dN(tx)s  •  •  ,  dN(tk)}  =  qk(tv  •  •  •  ,  tk)dtx  •  •  •  dtk 
for  /  j,  •  •  •  ,  tk  distinct.  The  k-th  factorial  cumulant  of  N(t)  is  now 

t  t 

j  ’  •  •  J  *  *  *  .  tk)dtx  •  •  •  dtk 
0  0 

In  the  case  of  a  Poisson  process,  the  product  densities  will  be  given  by 

pk(t\,  *  ‘  ,  /*)  =  p('i)  *  •  •  p(^) 

with  p  (t )  the  intensity  of  the  process  and  the  cumulant  densities  will  van¬ 
ish  for  k  >1. 

As  an  example  of  the  use  of  moments  to  derive  an  alternate  limit 
theorem,  suppose  one  has  N l (.),  •  •  •  ,  Nn(.)  i.i.d.copies  of  a  point  process 
N  (.).  Suppose  they  are  superposed  and  rescaled  to  form  the  point  process 

M„(l)  =  W,(-)+  +«„(-) 

n  n 

The  k-th  factorial  cumulant  of  this  process  is 

tin  tin 

s  ’  i n  •  •  •  *  h)dt\  •  •  •  dh 

0  0 

=  n(-)kqk(0,  •  •  •  ,  0) 
n 

for  large  n,  assuming  continuity  at  0.  This  cumulant  tends  to  tq^( 0)  for 
k  =  1  and  to  0  for  k  >  1  and  in  consquence  one  has  a  Poisson  limit  for 
the  variate  Mn  ( t ). 

5.  Extensions.  The  preceding  results  and  definitions  extend  quite 
directly  to  the  cases  of:  a  spatial  process  X(x,y),  a  marked  point  process 


£  Mj  5(t-Xj),  a  hybrid  process  X(Xj)  and  a  line  process,  for  example. 
j 

6.  An  example.  In  this  section  second-order  moments  and  cumulants 
are  employed  to  infer  the  causal  connections  amongst  some  contemporane¬ 
ous  point  processes. 

Consider  the  stationary  bivariate  point  process  (A/,  N)  with  points  xk 
and  yz  respectively.  In  what  follows  an  estimate  of  the  product  density  of 
order  2  will  be  needed.  The  parameter  is  defined  via 

pMN(u)  dudt  -  E  {dM  (t+u)dN(t)} 

=  Prob{dM(t+u)  =  1,  dN(t)  =  1} 

This  last  suggests  basing  an  estimate  on  the  count 

#(ltt  -Y,  -u\  <-|)  (8) 

for  some  small  bin  width  h.  Details  are  given  in  Brillinger  (1976).  One 
result  is  that  it  appears  more  pertinent  to  graph  the  square  root  of  the  esti¬ 
mate.  In  the  case  that  the  processes  M  and  N  are  independent,  one  will 
have  Pmn(u)  =  PmPn,  which  possibility  may  be  examined  via  the  statistic 
(8). 

The  suggested  estimate  will  be  illustrated  with  some  neurophysiologi¬ 
cal  data.  Concern  in  the  experiment  was  with  auditory  paths  in  the  brain 
of  the  cat.  To  collect  data,  microelectrodes  were  inserted  with  location 
tuned  to  sound  response.  Data  was  recorded  when  the  neurons  were  firing 
spontaneously.  Also  responses  were  evoked  experimentally  by  200  msec, 
noise  bursts,  that  were  applied  every  1000  msec.,  via  speakers  inserted  in 
the  ears.  The  firing  times  of  8  neurons  were  recorded.  Figure  1  provides 
the  data  itself  for  4  selected  cells,  2  in  the  case  with  stimulation,  2  when 
the  firing  is  spontaneous.  Each  horizontal  line  plots  firings  as  a  function 
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of  time  since  stimulus  initiation  in  a  1000  msec,  time  period.  The 
stimulus  was  applied  505  times  in  these  examples.  In  the  stimulated  case 
one  notices  vertical  darkening  corresponding  to  excess  firing  just  after  the 
stimulus  has  been  appiied.  Neurophysiologists  speak  of  locking.  In  the 
spontaneous  case  no  locking  is  apparent.  There  is  some  evidence  of  non- 
stationarity  in  this  case. 

Figure  2  provides  the  square  root  of  a  multiple  of  (8).  The  horizontal 
dashed  lines  are  ±2  standard  errors  about  a  horizontal  line  corresponding 
to  independence  in  the  stationary  case.  One  infers  that  the  cell  pairs  are 
associated  in  each  case.  However  in  the  stimulated  case  one  has  to  wonder 
if  the  apparent  association  of  units  6  and  7  is  not  due  to  the  fact  that  the 
cells  are  being  stimulated  at  the  same  times. 

Fourier  techniques  provide  one  means  to  address  this  concern.  Write 

4a>  =  s  <“ax‘ 

=  I 

l 

for  the  data  0  <  zk ,  <7.  For  X  *  0  one  has 

£(4(X)40)|  =  2kT  fMN(X) 
with  fMN  (.)  the  cross-spectrum  given  by 

e~lhtqMN(u)du 

A  useful  quantity  for  measuring  the  association  of  M  and  N  may  now  be 
defined.  It  is  the  coherence, 

*  ^MN  * 2  =  */m,V  (^)  '  2  1  f  MM  (k)//WV  (k) 
with  the  interpretation 

lim  I corr  {d^iX),  dfi(X)}  I2 
T  — >oo 

It  satisfies  0  <  \RMN(X)\ 2  <  1,  with  greater  association  corresponding  to 
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values  nearer  1.  Figure  3  provides  coherence  estimates  for  the  cell  pairs 
of  Figure  2.  This  evidence  of  association  is  in  accord  with  that  of  Figure 
2.  The  dashed  horizontal  line  provides  the  95%  point  of  the  null  distribu¬ 
tion  of  the  coherence  estimate. 

To  return  to  the  driving  question  of  how  to  "remove"  the  effects  of 
the  stimulus,  one  can  consider  the  partial  coherence.  This  has  the  interpre¬ 
tation 

lim  \corr  {dh  -  adj,  dj,  -  pdj}  l2 
T  — >oo 

with  a,  (3  regression  coefficients  and  S  referring  to  the  process  of  stimulus 
times.  Suppressing  the  dependence  on  k  the  partial  coherence  is  given  by 

1  1 5  1 2  where 

n  ^MN  ~  RMSKSN 

/<MN\S  ~  I - o - T- 

Vu-i*»«I2X1-IRabI2> 

Figure  4  provides  the  estimated  partial  coherence  cf  neurons  6  and  7  in 
the  stimulated  case.  The  level  apparent  in  the  top  graph  of  Figure  3  has 
fallen  off  substantially  suggesting  that  the  association  evidenced  in  Figures 

2  and  3  is  due  to  the  stimulus. 

For  interests  sake  Figure  5  provides  the  coherence  estimate  for  neu¬ 
rons  3  and  4  in  the  case  of  applied  stimulation.  One  might  wonder  if  they 
would  become  more  strongly  associated  in  the  presence  of  stimulation. 
The  results  do  not  suggest  that  this  has  happened. 

7.  Conclusions.  In  summary,  moments  and  cumulants  may  be 
employed  to  develop  approximations  to  distributions,  approximations  such 
as  the  normal  or  the  Poisson.  They  may  be  employed  in  system 
identification.  They  may  be  used  to  infer  the  "wiring"  diagram  of  a  col¬ 
lection  of  interacting  point  processes. 
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The  approach  presented  is  nonparametric,  not  based  on  special  sto¬ 
chastic  processes  described  by  finite  dimensional  parameters.  Brillinger 
(1991)  provides  a  variety  of  references  concerning  the  work  pre  1980  on 
higher  moments  and  spectra. 

Acknowledgements.  The  neurophysiological  data  were  provided  by 
Alessandro  Villa.  Terry  Speed  mentioned  the  Streitberg  (1990)  result. 

REFERENCES 

BRILLINGER,  D.  R.  (1976).  Estimation  of  the  second-order  intensities  of 
a  bivariate  stationary  point  process.  J.  Roy.  Statist.  Soc.  B  38,  60-66. 
BRILLINGER,  D.  R.  (1977).  The  identification  of  a  particular  nonlinear 
time  series  system.  Biometrika  64,  509-515. 

BRILLINGER,  D.  R.  (1991).  Some  history  of  the  study  of  higher-order 
moments  and  spectra.  Statistica  Sinica  1,  465-476. 

BRILLINGER,  D.  R.  (1992).  Nerve  cell  spike  train  data  analysis:  a  pro¬ 
gression  of  technique.  J.  Amer.  Statist.  Assoc.  87,  June. 

FELLER,  W.  (1966).  An  Introduction  to  Probability  Theory  and  Its  Appli¬ 
cations,  Vol.  II.  New  York,  Wiley. 

KORENBERG,  M.  J.  (1973).  Cross-correlation  analysis  of  neural  cas¬ 
cades.  In  Proc.  10th  Annual  Rocky  Mountain  Biomed.  Symp.  47-51. 
Instrument  Society  of  America,  Pittsburgh. 

LEE,  Y.  W.  and  SCHETZEN,  M.  (1965).  Measurement  of  the  Wiener  ker¬ 
nels  of  a  nonlinear  system  by  crosscorrelation.  Intemat.  J.  Control  2, 
237-254. 

LEONOV,  V.  P.  and  SHIRYAEV,  A.  N.  (1959).  On  a  method  of  calcula¬ 
tion  of  semi-invariants.  Theor.  Prob.  Appl.  4,  319-329. 


119 


RUDZKIS,  R.,  SAUUS,  L.  and  STATU LEVICIUS,  V.  (1978).  A  general 
lemma  on  probabilities  of  large  deviations.  Liet.  Mat.  Rink.  18,  99- 
116. 

SPEED,  T.  P.  (1983).  Cumulants  and  partition  lattices.  Austral.  J.  Statist. 
25,  378-388. 

STREITBERG,  B.  (1990).  Lancaster  interactions  revisited.  Ann.  Statist. 
18,  1878-1885. 

TICK,  L.  (1961).  The  estimation  of  transfer  functions  of  quadratic  sys¬ 
tems.  Technometrics  3,  563-567. 

DEPARTMENT  OF  STATISTICS 
UNIVERSITY  OF  CALIFORNIA 
BERKELEY,  CA  94720 


120 


Figure  Legends 

Figure  1.  Rastor  plot  of  the  firing  times  of  4  neurons  in  successive  1000 
msec,  periods.  There  are  505  horizontal  lines  of  firing  times. 

Figure  2.  The  square  root  of  a  multiple  of  the  quantity  (6).  Were  the 
processes  independent  and  stationary  then  about  5%  of  the  values  should 
lie  outside  the  band  defined  by  the  two  horizontal  dashed  lines. 

Figure  3.  Estimated  coherences  of  cells  6  and  7  in  the  stimulated  case  and 
3  and  4  when  the  firing  is  spontaneous. 

Figure  4.  Estimated  partial  coherence  of  cells  6  and  7  "removing"  the 
effect  of  the  stimulus. 

Figure  5.  The  estimated  coherence  of  cells  3  and  4  in  the  case  of  stimula¬ 
tion. 
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Abstract 

Consult':'  finite  mixture  models  of  the  form  g(x\  Q)  —  f  /(x;  6)dQ{9) 
where  /  i'  a  parametric  density  and  Q  is  a  discrete  probability  mea¬ 
sure.  An  important  and  difficult  statistical  problem  concerns  the  de¬ 
termination  of  the  number  of  support  points  (usually  known  as  com¬ 
ponents)  of  Q  from  a  sample  of  observations  from  g.  For  an  important 
class  of  exponential  family  models  we  have  the  following  result:  if  P 
has  more  than  p  components,  and  Q  is  an  appropriately  chosen  p 
component  approximation  of  P  then  g(x\  P )  -  g{x\  Q)  demonstrates  a 
prescrib-  d  sign  change  behavior,  as  does  the  corresponding  difference 
in  the  distribution  functions.  These  strong  structural  properties  have 
implications  for  diagnostic  plots  for  the  number  of  components  in  a 
finite  mixture. 


1  Introduction 

Consider  .1  family  of  univariate  probability  densities  with  respect  to 

some  rr  finite  measure  c/-:(.r).  parameterized  by  9  £  0.  Frequently,  interest 

"T!.a  anthers  were  supported  by  NSF  grants  DMS-910GS95  to  Lindsay  and  DMS- 

9im  1-421  ta  Ii.-ed-  r 
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lies  in  mixtures  of  such  densities.  The  random  variable  X  is  said  to  have  a 
mixture  distribution  G(-,Q)  if  X  has  density 

<>(*;<?)  =  /  f(x-,0)dQ(9),  (1) 

and  the  mixing  distribution  Q  is  a  probability  measure  on  Cl.  If  Q  has  a 
finite  number  of  support  points  v  =  v{Q)  then  we  say  Q  is  a  finite  mixing 
distribution  and  we  write  Qu  =  '527r}£(& j)  with  being  the  support 

points  (often  called  components)  and  7rq, . . .  ,7r„  being  the  weights. 

A  problem  of  longstanding  interest  in  such  models  is  inference  on  the 
unknown  value  of  i'(Q).  At  the  simplest  level,  this  is  the  problem  of  deter¬ 
mining  if  v  equals  1.  the  o^  <  mponent  model,  or  is  greater  than  1,  the 
multicomponent  model,  o'  -kcd  (1980)  presented  important  results  for  this 
problem  when  the  component  densities  f(x;  0)  are  one  parameter  exponential 
family.  We  cxt'md  his  results  in  two  directions,  generalizing  to  the  discrim¬ 
ination  between  u  =  p  versus  u  >  p.  and  moving  beyond  the  one  parameter 
exponential  family  to  the  normal  mixture  model  in  which  each  component 
has  a  different  mean,  but  the  same  unknown  variance. 

Here  we  summarize  Shaked’s  sign  crossings  results.  Suppose  we  wish  to 
contrast  a  multicomponent  model  <j{x.Q )  with  a  plausible  one  component 
model  /(>:#).  Choose  0  =  9*  for  the  one  component  model  so  that  the 
observed  variable  A  has  the  same  mean  under  both  densities: 

J  .ra{.r:Q)(l-,(.r)  =  J  xf(.r:  6')  d~:{ x). 

Our  notation  for  this  last  equation  will  be  E[X;Q]  =  £[A’’;0*].  Shaked 
showed  that  <;(x:())  —  f{x:6‘)  has  exactly  two  sign  changes,  in  the  order 
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(-r.  — .  -f),  as  x  traverses  the  sample  space.  That  is,  g{x\  Q)  has  heavier  tails 
than  f[x:ftm).  Moreover,  the  difference  in  distribution  functions  G(x;Q)  — 
F(x\  ft’)  has  exactly  one  sign  change,  in  the  order  (  +  ,  — ). 

We  extend  his  results  as  follows:  let  P,  the  nominal  true  mixing  distribu¬ 
tion.  satisfy  r(P)  >  choose  Q,„  a  candidate  p-point  probability  measure, 
such  that  it  satisfies 

£;A'*;  P)  -  E[Xk:Q,.].  A*  =  0, 1. . . .  2p  —  1.  (2) 

(In  Section  2  we  show  how  to  solve  for  Qp.)  Then,  in  Theorem  3.2,  we  show 
that  i ;/(./•;  P)  —  <;(.r;  Qfl)  has  exactly  2/> sign  changes  in  the  order  (  +  +), 

unless  it  is  identically  zero  (the  case  of  nonidcntifiable  P).  An  exact  sign 
change  result  for  the  difference  in  distribution  functions  is  also  given  in  Sec- 
fioii  3.  In  Section  1.  these  results  are  extended  to  normal  densities  with 
unknown  vaiiance. 

Before  proceeding  to  the  mathematical  verification  of  these  results,  we 
offer  a  few  bn  f  comments  on  their  potential  application.  In  Figure  1.  we 
plot  >4.;(,r:  P)  •••  «;l.r:  Q-.)]  j \hj{x:  P)  for  the  case  when  f{x\9)  is  Poisson.  P 
puts  mass  1  1  ea>  h  at  (1.3  and  5).  and  O :  is  constructed  to  match  moments 
as  spe.  died  in  (2).  We  note  the  clear  tiimodality  of  this  function,  in  constrast 
to  the  unimodality  of  the  density  g{x\P)  (Figtrre  2). 

Shaked  demon-, tinted  that  his  sign  change  results  could  be  used  for  di- 
agnosti,  die,  ks  to  determine  if  tie-  data  were  from  a  mixture  of  specified 
expon-nti.il  family  densities  rather  than  a  one  component  model.  These 
ideas  u.'te  fmth'  i  *l« « 1« »j»> in  Lindsay  and  Reeder  (1992).  When  interest 
ties  in  a"*  -  si;;  tin  numb  r  of  <  •  'lnponents  in  a  finite  mixture,  the  oscillation 


129 


results  obtained  in  this  article  have  clear  implications  for  diagnostics  plots. 
In  a  companion  paper  these  results  are  used  to  develop  diagnostic  plots  for 
the  case  of  normal  mean  mixtures  with  unknown  variance  (Roeder  1992). 

2  Background 

2.1  The  models  m  der  investigation 

We  will  be  interested  in  component  densities  f(x:  0)  where  both  x  and  9  have 
ranges  in  the  real  numbers,  say  x  £  T  C  R  and  6  £  [/,u]  C  fi.  Furthermore, 
/(•;  •)  satisfies  regularity  conditions  which  will  be  expounded  in  this  subsec¬ 
tion.  Although  the  most  important  application  of  the  results  to  follow  is  the 
one  parameter  exponential  family,  the  results  readily  extend  to  other  cases 
of  interest  for  which  we  need  the  following  terminology. 

A  real  function  of  two  variables.  I\(x.O),  ranging  over  linearly  ordered  sets 
T  and  °  is  said  to  be  totally  positive  (TP)  if  certain  determinantal  inequalities 
hold  (Karlin  1908.  p  11.  15).  For  instance,  the  functions  exp(#x)  and  I(x  < 
ft)  are  TP.  In  addition,  many  density  functions  occurring  in  statistical  theory 
are  TP.  For  example,  the  one  parameter  exponential  family  with  density 
function  I\(x:9 )  =  exp{#.r  —  r(#)}.  Other  examples  include  the  noncentral-t 
and  noncentr.il- \2  densities.  In  fact,  all  of  the  densities  mentioned  above  are 
strictly  TP  (STP:  Karlin  1968.  p.  12).  For  a  more  extensive  list,  see  Karlin 
190S.  p.  117).  We  will  say  that  f(x:9)  is  an  STP-model  if  f(x;9)  is  strictly 
totally  positive  in  x  and  9. 
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2.2  Background  on  moments  and  exponential  fami¬ 
lies 

In  order  to  apply  our  results  in  a  particular  model  we  need  to  establish  two 
important  structural  features  for  the  component  densities  f(x\9).  Our  first 
requirement  is  as  follows:  suppose  that  P  is  a  mixing  distribution  with  p 
or  more  support  points.  Then  we  need  to  be  able  to  construct  a  p-point 
distribution  Qp  such  that  the  first  2 p—  1  moments  of  g(x\  P)  and  g{x\Qp) 
match,  satisfying  (2).  Fortunately,  there  exists  an  important  class  of  expo¬ 
nential  families  (the  quadratic  variance  class)  in  which  Qp  satisfying  (2)  can 
be  shown  to  exist.  This  class  includes  the  normal,  gamma,  Poisson  and  bi¬ 
nomial  distributions.  The  following  is  a  brief  review  of  techniques  found  in 
Lindsay  (1989). 

In  the  quadratic  variance  family  of  exponential  family  models  (Morris 
1983).  for  each  k.  there  exists  a  polynomial  of  degree  k,  call  it  £j.(x),  such 
that 

J  ^(x)f(x-J)<h,(x)  =  (p-p 0)k  (3) 

for  mean  value  parameter  p.  The  choice  of  pQ  is  arbitrary  so  we  set  it  to 
zero.  For  example,  in  the  Poisson  with  mean  p.£“[A’]  =  p,  E[X(X  —  1)]  = 
P‘ •  E{  A(.\  —  1)(A  -  2)]  =  fP  and  so  forth.  In  addition,  a  classical  moment 
result  indicates  that  for  a  given  distribution  P  with  no  fewer  than  p-points 
of  support,  there  exists  a  unique  distribution  Qv  with  exactly  p-points  of 
support  such  that 

/  /</<?»  =  /  ek'IP(p).  k  —  1 .  . . . ,  2p  —  1 .  (4) 

Thus  integrating  both  siiles  of  (3)  with  respect  to  dQp(p)  and  dP(p).  and 
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using  (4)  yields 


£[&(-*);  P]  =  £[&(*);  Qpl  k  =  1, . . . ,  2p  -  1.  (5) 

Finally,  the  map  taking  (1,  x, . . . ,x2p_1)  — >  (£0(x),  6(*),  ■  ■  ■  >  6p-i(x))  is  in¬ 
vertible,  so  (5)  implies  (2). 

More  details  on  solving  (5)  for  Qp  are  given  in  Lindsay  (1989).  The  solu¬ 
tions  can  be  obtained  algebraically  for  p  =  2.  For  arbitrary  p ,  the  problem 
involves  solving  a  degree  p  polynomial  for  its  p  real  roots. 

3  One  parameter  models 

In  this  section  we  obtain  sign  change  results  for  one  parameter  models.  The 
following  notation  (Karlin  19G8,  p.  20)  will  be  used.  Let  a(x)  be  defined  on 
I  where  I  is  a  subset  of  the  real  fine.  The  number  of  sign  changes  of  a  in  I 
is  defined  In¬ 
s'- (a)  =  sup  5“[a(x1), . . .  ,a(xm)]  (6) 

where  S~(yx. . . . ,  y,„ )  is  the  number  of  sign  changes  of  the  indicated  sequence, 
zero  terms  being  discarded,  and  the  supremum  is  extended  over  all  sets 

xj  <  x2  <  . . .  <  xm  (x,  €  /);  m  <  oc.  (7) 

We  assume  throughout  that  /(x;$)  is  an  STP  kernel  and  that  P  and 
Qr  satisfy  (2).  The  following  notation  will  be  used  throughout  this  section: 
9\  =  g(r:  P).  9>  =  g{x\ Qp).  G'j  =  G(x;  P)  and  G2  =  G(x;  Qp ). 

Remark  In  the  following  result  we  will  give  exact  sign  change  results  for 
pi  -  92  with  the  proviso  "the  difference  pi  —  g2  is  not  identically  zero”.  If 
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such  an  equality  in  densities  occurs,  it  is  clear  that  there  is  an  identifiability 
problem;  both  P  and  Qp  are  generating  the  same  distribution.  The  results  of 
Lindsay  and  Roeder  (1992)  can  be  used  to  determine  exactly  when  this  will 
occur.  If  the  sample  space  is  infinite,  it  will  not  occur.  If  the  sample  space  has 
N  points,  then  />- point  distributions  Qp  are  identifiable  when  p  <  (N  —  1  )/2, 
and  so  —  g2  cannot  be  identically  zero.  If  both  P  and  Qp  have  more  than 
(.V  —  l)/2  points,  then  gx  —  g2  cannot  have  exactly  2 p  sign  changes,  since  we 
can  have  at  most  X  —  1  sign  changes  as  we  traverse  the  sample  sp>ace.  Thus 
our  result  proves  that  P  and  Qp  generate  the  same  density.  ■ 

Lemma  3.1  Provided  gx  —  g2  is  not  identically  zero,  S~(gx  —  g2)  <  2 p. 


Proof  Define  the  measure  d\{9)  by 


dX(8)  =  d(P  +  Qp)(6). 


Let 


P'(») 


and 


C({'>))/!C({0})  +  C?P((«})]  if  e  e 

1  else. 


q(6)  = 


e, .((»};/!/’({'>))  +  <?,({«})]  >i*e  {«. . M 

0  else. 


Then  />*  and  <['  are  versions  of  the  Radon-N ikodym  derivatives  dP/d\  and 
dQ,,/d\.  so  that  -7,  -  //-  =  Jf(.r:9)[p(8)  -  q'(8)}d(P  +  Q „)(()). 

Wr  now  ajiply  Theorem  3.1  (b)  of  Karlin  (19GS).  noting  that  p'{9)  —  q'{8) 
equals  one  except  possibly  at  the  support  of  0p.  where  it  can  be  negative. 
Hence  it  undergoes  a  maximum  of  2 />  sign  changes.  Karlin's  result  then 
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implies  that  integration  with  respect  to  the  STP  kernel  f(x]6)  will  result  in 
a  function.  <71  —  #>.  with  no  more  sign  changes  in  x  than  p‘(9)  —  q'(9)  has 
in  9  relative  to  d\.  This  establishes  an  upper  bound  of  2 p  sign  changes  in 

Hi  -  9 1-  ■ 

Theorem  3.2  Provided  —  g2  is  not  identically  zero,  S~(g\  —  g2)  =  2 p, 
with  sign  changes  in  the  order  (  +  ,  —  -f). 

Proof  From  Lemma  1,  we  obtain  an  upper  bound  on  the  number  of  sign 
changes  of  2 p.  Because  /  xk(g\  —  g2)(x)di/(x)  —  0,  for  k  =  1, . . . ,  2p  —  1,  any 
polynomial  A(x)  of  degree  <  2p  —  1  satisfies 

J  A(x)(gx  -  g2){x)d~,(x)  =  0. 

Suppose  S~(gx  —  g2)  <  2 p  —  1.  Then  we  can  construct  a  polynomial  A( x) 
that  matches  gx  —  g2  in  sign  (i.e..  it  has  single  roots  exactly  at  the  roots  of 
<7i  —  g2).  It  follows  that  ,4(r)(pi  —  g2)  >  0,  and  since  it  has  zero  integral 

it  must  be  zero  except  for  a  set  of  '-measure  zero.  Hence  either  g\  =  g2,  or 
9\  ~  9:  has  2p  sign  changes.  ■ 

Remark  As  is  clear  from  the  proof  for  this  result,  our  oscillation  results  still 
hold  if  Wf  replace  xk  in  (2)  with  any  system  of  functions  ajt(x).  such  as  xke~z. 
provided  that  one  can  construct  a  polynomial  A( x)  —  which  has 

any  prexpocifird  set  of  2 p—  1  zeroes.  Such  an  approach  could  be  useful  in 
improving  on  the  robustness  of  the  sample  moments  in  applications  by  using 
bound-  d  variables  such  as  a^(x)  =  xke~x .  The  next  theorem,  however,  uses 
tile  Special  full!  of  X  .  ■ 
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Theorem  3.3  Provided  Gi—G?  is  not  identically  zero,  S  {G\  —  G2)  =  2p—  1, 
with  sign  changes  in  the  order  (-f,— The  roots  occur  between  the 
roots  of  gx  -  g2 . 


Proof  An  upper  bound  is  obtained  on  the  number  of  sign  changes  by  ap¬ 
pealing  to  the  sign  change  behavior  of  g\  —  g2.  The  function  G\  —  G2  is 
increasing  on  the  intervals  [a,  6]  where  g2  —  g2  >  0: 

G1(b)  -  G2(b)  -  (Gj(a)  -  G2(a))  =  J I  [a  <  x  <  b]  (9l  -  g2)(x)  d7(x)  >  0. 

From  this  it  follows  that  Gi  —  G2  has  at  most  one  crossing  in  each  interval 
where  gx  —  g2  is  constant  in  sign,  but  has  none  in  the  first  or  last  interval. 
Hence  S~(G i  —  G2)  <  2p  —  1.  Integration  by  parts  gives 

0  =  J  xd{G !  -  G2)(i)  =  J [G2  —  G]](x)dx, 

and  more  generally 

0  =  /  xkd(G1  -  G2)(x)  =  J  xk~l[G2  -  G^dx, 

up  to  A'  =  2p  —  1.  Now,  follow  the  proof  of  Theorem  3.2.  If  G2  —  G'i  had 
2p  —  2  or  fewer  sign  changes,  a  polynomial  A(x)  of  degree  2p  —  2  could  be 
constructed  with  matching  signs.  Hence  ,4(x)[Gz  —  Gij(x)  >  0,  but  has  zero 
integral.  The  result  follows.  ■ 


For  continuous  X,  a  diagnostic  plot  based  on  a  nonparametric  empirical 
analog  of  G\  —  G2  can  be  constructed  directly.  Let  Fn.  the  empirical  distribu¬ 
tion  function,  be  an  estimate  of  the  alleged  distribution  G\  and  let  G2  be  an 
estimate  of  G2  constructed  by  using  the  method  of  moments  estimates  of  the 
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p-componcnt  model.  Naturally,  Fn  and  G2  have  2p  —  1  moments  in  common. 
It  follows  that  if  Fn  —  G2  has  the  appropriate  sign  change  behavior,  then 
the  data  provide  some  support  for  using  more  than  p  components.  On  the 
other  hand,  if  a  p-point  mixture  is  the  correct  model,  then  the  asymptotic 
properties  of  Fn  —  G2  can  be  obtained  from  empirical  process  theory. 

4  Normal  Mean  Mixtures  with  Unspecified  Variance 

In  this  section  we  consider  a  mixture  model  of  great  interest  —  the  normal 
mean  mixture.  We  use  the  following  notation:  let  /(r;p,r)  denote  the  den¬ 
sity  of  a  X(fi.  r)  random  variable  and  let  g(x,Q,r)  =  /  f(x;  g,r)  dQ(p)  de¬ 
note  a  mixture  of  normals  with  corresponding  distribution  function  G(x\  Q,  r). 
If  t  were  known,  then  this  is  just  a  special  case  of  the  previous  section:  how¬ 
ever,  in  practice,  r  will  typically  be  unknown  and  hence  we  treat  it  as  a  free 
parameter.  In  this  section  we  extend  our  results  to  this  case.  We  first  present 
an  existence  theorem,  due  to  Lindsay  (1989),  which  extends  the  classic  mo¬ 
ment  results  presented  in  Section  2  to  normal  mixtures. 

Theorem  -1.1  If  Q  is  a  distribution  with  more  than  p-points,  then  there 
exists  a  unique  p-point  distribution  Qp  and  variance  rp  >  r  such  that 

J  xkdG(x\  Qp.Tjf]  =  J  xkdG(x\Q<r)  for  k  =  0, 1, . . . .  2p.  (8) 

Proof  While  this  is  not  explicitly  stated  in  Lindsay  (1989).  it  is  a  conse- 
quem-r  ,,f  Lemma  5A  and  Theorem  5C.  In  the  latter,  replace  the  empirical 
moment''  with  tie  moments  of  A’  under  G(-.Q.r).  ■ 
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Theorem  4.2  If  (Qp,rp)  satisfies  ( S )  for  Q  =  Qp+\,  a  p  +  1  -point  distribu¬ 
tion,  then 

gfaQ,* ut)  -  9{x\Qp>tp) 

has  exactly  2p  -f  2  sign  changes,  occuring  in  the  order  (  —  ,4-,...,+,—). 

Proof  Since  rp  >  r,  we  can  represent  the  above  difference  as 

9(x\Q,t)  -  g{x;  Qp,r) 

where  Q‘  is  the  convolution  of  Qp  with  a  normal  distribution  with  mean  zero 
and  variance  rp  —  r .  By  the  same  argument  as  in  Lemma  1,  this  means  there 
are  a  maximum  of  2p  +  2  sign  changes.  The  polynomial  argument  used  in 
the  proof  of  Theorem  3.2  can  now  be  used  together  with  (8)  to  show  that 
there  are  at  least  2 p  +  1  sign  changes.  Moreover,  since  Qp  has  more  mass 
in  the  tails  than  the  discrete  Qp+1,  the  difference  g(x;  Q,r)  —  g(x\  Qp,r )  will 
have  a  negative  sign  in  both  tails,  and  so  must  have  an  even  number  of  sign 
changes,  hence  2p  +  2.  ■ 

Th  eorern  4.3  C(x:Q.r)  —  G(.r;  Qp.  rp)  has  exactly  2p+  1  sign  changes,  in 
the  order  (  —  ,.,-f). 

Proof  A  similar  argument  to  Theorem  3.3.  ■ 


Graphical  techniques,  such  as  the  normal  scores  plot  (Harding  1948, 
Cassie  1954)  and  the  modified  percentile  plot  (Fowlkes  1979)  have  played 
an  important  role  in  identifying  whether  data  follows  a  mixture  of  two  nor¬ 
mal  distributions.  The  geometric  characterizations  obtained  herein  extend 
the  arsenal  of  potential  diagnostic  plots  for  normal  mixtures. 
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5  D  iscussion 


Our  results  above,  in  the  normal  rase,  indicate  that 

-  g(x-fi,a2) 

has  4  sign  changes  in  the  order  (  —  provided  g.  is  the  mean  of 

Qi  and  a 2  =  \'ar( X)  =  r  -t  l'ar(Q2).  For  this  case  a  supplementary  result 
is  available  from  Roeder  (1992).  If  we  instead  examine  the  ratio  R(x)  = 
g(x\  Q2,  t)/</( x\  /i ,  a2),  we  obtain  a  function  proportional  to  a  bimodal  normal 
density.  By  combining  the  two  results  we  can  see  that  R(x)  is  bimodal  and 
that  both  inodes  are  greater  than  1. 

In  the  normal  model,  with  ”i  =  x2  =  1/2,  the  density  g(x:Q2,T)  is 
bimodal  if  and  only  if  the  two  separate  supports  and  /i2  satisfy  |//i  — 

ft2 1  >  2 r  (Robertson  and  Fryer  19G9).  Thus  the  ratio  function  is  much  more 
sensitive  to  the  existence  of  two  support  points  than  is  the  density  itself. 
This  sensitivity  continues  to  exist  even  for  very  small  support  weights  tt,. 
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Abstract 


The  normal  distribution  has  long  been  the  usual  model  for  the  analysis  of  multivariate  data. 
Moment  and  probability  calculations  for  the  multivariate  normal  are  used  in  applications  such 
as  the  construction  of  confidence  sets,  the  assessment  of  error  rates  in  signal  processing,  and  the 
construction  of  optimal  quantizers.  Recently,  the  family  of  elliptically  contoured  distributions, 
which  includes  the  normal,  has  been  extensively  studied.  In  this  paper,  we  discuss  moment  and 
probability  calculations  for  this  broader  class,  paying  particular  attention  to  the  approximation 
of  tail  probabilities. 
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1  Introduction 


The  normal  distribution  has  long  been  the  usual  model  for  the  analysis  of  multivariate 
data.  Moment  and  probability  calculations  for  the  multivariate  normal  have  therefore  been  well 
studied  for  various  cases  of  interest.  In  statistics,  a  common  application  of  such  quantities  is 
the  construction  of  confidence  sets  for  parameters  of  the  normal  distribution.  Other  examples 
include  the  assessment  of  error  rate  probabilities  in  signal  processing,  the  construction  of  optimal 
quantize:.*  f^r  a  Gaussian  process,  and  the  computation  of  a  high  order  correlation  coefficient  of 
the  outputs  from  a  zero-memory  non-linear  device  with  Gaussian  inputs. 

The  general  problem  is  still  intractable,  owing  to  the  great  difficulty  in  evaluating  high 
dimensional  integrals,  but  advances  in  computing  technology  and  recent  research  has  yielded 
innovative  Monte  Carlo  and  numerical  integration  techniques.  These  advances  have  widened 
the  scope  of  such  investigations  to  include  other  multivariate  distributions.  For  instance,  there 
are  the  elliptir.ally  contoured  distributions  and  the  multivariate  Pearson  family  of  distributions, 
both  of  which  include  the  multivariate  normal.  Elliptically  contoured  distributions,  in  particular, 
have  been  extensively  developed:  see  the  collection  of  papers  about  them  that  was  recently  edited 
by  Anderson  and  Fang  [2], 

In  this  paper,  we  study  the  computation  of  probabilities  and  moments  for  certain  elliptically 
contoured  distributions,  and  discuss  their  applications.  There  are,  of  course,  many  classes  of 
events  whose  probabilities  are  of  interest,  and  many  functions  whose  expectations  are  of  interest. 
Our  focus  will  be  on  the  evaluation  of  tail  probabilities,  and  on  methods  for  computing  product 
moments,  and  other  non-linear  functions  of  the  components  of  the  random  vector.  In  Section  2, 
we  introduce  elliptically  contoured  distributions,  and  describe  their  properties.  Historically,  mo¬ 
ment  methods  have  been  associated  with  Pearson’s  family  of  distributions.  Since  some  elliptically 
contoured  distributions  are  also  natural  multivariate  versions  of  some  of  Pearson’s  distributions, 
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we  briefly  describe  this  connection  also.  In  Section  3,  we  discuss  applications  of  tail  probabili¬ 
ties,  and  describe  methods  for  approximating  them  accurately.  These  methods  include  Monte 
Carlo  with  importance  sampling,  and  asymptotic  approximations  that  generalize  Mills’  ratio  for 
the  normal  distribution.  In  Section  4,  we  turn  to  moment  calculations  for  elliptically  contoured 
distributions  using  one  of  three  tools:  the  characteristic  function,  a  stochastic  representation, 
and  a  certain  partial  differential  equation  satisfied  by  sufficiently  smooth  elliptically  contoured 
densities. 

2  Elliptically  Contoured  Distributions  and  Pearson  Families 

A  p-dimensional  vector  X  has  an  elliptically  contoured  distribution  if  there  is  a  non-negative 
definite  matrix  E  =  (cr^ )  such  that  the  characteristic  function  of  X  is  /(t)  =  elt' where 
0  is  a  real-valued  function  on  IR+  =  [0,oo).  Then  X  has  the  stochastic  representation 

X  =  n  +  rE^t/p,  (1) 

where  p  is  the  center  of  symmetry,  the  radial  part  r  is  a  non-negative  random  variable,  and 
Up  is  uniformly  distributed  on  Qp,  the  surface  of  the  unit  sphere  in  p-dimensions;  r  and  Uv  are 
independent.  The  matrix  E1/2  is  a  square  root  of  E:  for  computations,  it  is  convenient  to  take 
E1/2  to  be  the  lower  triangular  matrix  from  the  Cholesky  decomposition,  or  the  non-negative 
definite  symmetric  square  root  derived  from  the  spectral  representation  of  E.  When  X  has  a 
density  /,  it  is  of  the  form 

/(*;*!,£)  =|E|-i  g(Q),  (2) 

where  Q  =  Q(x,p.E)  =  (x  -  p)'E_1(x  -  p),  g  :  1R+  -*■  1R+, 

too 

ap  I  Tp-'g(r2)dr=  1,  (3) 

Jo 
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and  ap  is  the  area  of  the  level  curves  of  /  are  ellipses  determined  by  {x  :  Q  =  c}.  In  this  case, 
r  has  the  density  hT(r)  =  aprp-1g(r2).  Examples  of  ellipticaJly  contoured  distribution*  inn  :ae 
the  normal,  for  which  g(r)  =  \l)(r)  =  e-r/2,  and  the  p-variate  t  distribution  with  v  degrees  of 
freedom,  for  which 


JpA  ,M’  ’  (rv)r/*r(v/2y  +Q/  } 


(4) 


Another  example  is  due  to  Iyengar  [12]  (see  also  [15]): 


fv,k{x\n,  E,r/) 


T(p/2) 

r  (k  +  p/2) 


|r?7rE|  1/2  {Qh)k  exp(-g/p). 


(5) 


where  r;  >  0  and  >  0.  When  ^  =  0,  (5)  yields  the  normal  distribution.  For  the  bivariate 
case,  Kotz  [20]  has  also  studied  this  family.  The  uniform  distribution  on  (lp  is  yet  another 
example  which  will  be  used  for  moment  calculations  below;  it  does  not  have  a  density.  For 
further  discussion  of  elliptically  contoured  distributions,  see  Anderson  and  Fang  [2],  Das  Gupta, 
et  al.  [8],  and  Cambanis,  et  al.  [5]. 

In  one  dimension,  Pearson’s  family  of  distributions  is  defined  by  the  following  differential 
equation  satisfied  by  their  densities  (see  Cramer  [7]): 

d  log  f(x)  x  +  a 

dx  b0  +  b\x  +  6212 

Within  this  family,  the  first  four  moments  determine  the  distribution.  Several  types  of  Pearson 
distributions  (depending  on  a,b0,bx,  and  b2)  have  been  identified.  In  addition  to  the  normal, 
the  common  types  are  the  beta  (Type  II),  gamma  (Type  III),  and  Student’s  t  (Type  VII).  The 
elliptically  contoured  distributions  given  by  (4),  and  (5)  are  multivariate  versions  of  Types  VII 
and  III.  respectively.  For  example,  when  R  =  /  and  p  =  0,  the  density  for  the  p-variate  t 
distribution  with  u  degrees  of  freedom,  satisfies  the  following  differential  equation: 
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However,  there  is  an  important  difference  between  (4)  and  (5).  For  (5),  if  p  =  0  and  k  >  0,  then 
the  density  at  the  origin  is  0,  and  the  modal  value,  or  peak,  of  the  density  occurs  on  the  surface 
of  the  ellipsoid  {x  :  x'T,~xx  =  kr)}.  On  the  other  hand,  the  density  in  (4)  has  its  peak  at  the 
origin,  and  it  is  unimodal.  Several  results  that  apply  to  the  normal  and  (4)  do  not  generalize  to 
(5);  see  Tong  [34]  for  further  details. 

3  Tail  Probabilities 

If  X  is  a  random  variable  with  density  /  and  cumulative  distribution  function  F,  the  tail 
probability  of  X  refers  to 

9  =  l  -  F(a)  =  f  f(x)dx  (8) 

J  a 

for  large  values  of  a.  In  many  statistical  applications,  such  as  hypothesis  testing,  the  tail 
probability  of  interest  is  around  0.05.  For  such  cases,  the  computation  of,  say,  p-  values  is  usually 
straightforward.  In  other  applications,  especially  in  engineering,  much  smaller  probabilities  are 
of  interest.  For  instance,  in  signal  processing,  the  tail  probability  arises  as  the  error  rate  of 
a  complex  communications  system  (Scharf  [30],  Wessel,  et  al.  [35]);  and  in  reliability  theory, 
it  arises  as  the  failure  rate  of  a  system  component  (Lawless  [22]).  Often  such  systems  have 
redundancies  built  into  them,  so  that  their  error  or  failure  rates  are  very  low.  A  simple  model 
of  failure  regards  X  as  an  overall  index  of  stress,  and  considers  very  large  values  of  the  failure 
threshold,  a. 

In  this  formulation  of  the  problem,  two  difficulties  arise.  First,  the  usual  quadrature  rules 
and  Monte  Carlo  methods  for  evaluating  9  are  not  sufficiently  accurate,  so  specialized  methods 
are  needed  for  evaluating  tail  probabilities.  We  will  turn  to  some  of  these  methods  below.  Next, 
the  basis  for  the  choice  of  probabilisitic  model  (that  is,  F)  is  tenuous.  This  is  because  for  a 
complex  system,  the  theoretical  derivation  of  F  based  on  individual  component  characteristics  is 
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intractable;  also,  data  to  estimate  9  is  sparse  since  the  event  of  interest  is  rare.  While  information 
about  the  central  region  (near  the  mean  or  median)  of  F  is  usually  available,  the  tail  behavior 
is  usually  unknown,  so  extrapolation  is  necessary.  One  way  of  addressing  this  problem  is  to 
consider  a  wide  range  of  plausible  models  for  the  tail  behavior  to  derive  a  range  of  values  for  the 
tail  probability.  For  one  example  of  just  such  an  approach,  see  Lavine  [21],  who  studied  shuttle 
O-ring  data. 

Multivariate  versions  of  this  problem  arise  in  similar  fashion:  for  instance,  a  system  with 
two  components  may  fail  when  each  component’s  stress  exceeds  its  respective  threshold,  leading 
to  the  failure  probability  P(X\  >  a\,X^  ~>  <22)-  A  number  of  new  difficulties  also  arise.  First, 
multiple  integration  is  still  a  hard  problem  in  general,  so  with  few  exceptions  multivariate  tail 
probabilities  are  not  well  studied.  Also,  a  tail  region  can  take  on  many  shapes,  for  example, 
{x  :  xi  >  ai,X2  >  <12)1  {x  :  ai^i  +  a2X2  >  a},  or  {x  :  xj  +  x\  >  a2}.  Below,  we  restrict 
attention  to  convex  regions  that  are  far  from  the  center  of  the  distribution,  eliminating  the  last 
example  from  consideration. 

There  are  two  main  sources  of  error  in  assessing  tail  probabilities.  The  first  is  numerical: 
it  is  generally  hard  to  evaluate  a  small  quantity  with  small  relative  error.  For  a  deterministic 
method,  if  6  is  an  approximation  to  9,  the  relative  error  is  ( 6  -  9)/ 9.  For  a  Monte  Carlo  method, 
the  coefficient  of  variation  (the  ratio  of  the  standard  deviation  to  the  mean  of  an  estimator)  is 
a  measure  of  the  relative  error.  If  the  unbiased  estimator  9n  of  9  is  an  average  of  n  independent 
replicates,  its  squared  coefficient  of  variation  (cv2)  is 


c v2(0„)  = 


var (9n) 


E(9j) 

92 


Below,  we  study  the  use  of  Monte  Carlo  with  importance  sampling  to  derive  estimators  for 
which  the  cv2  is  small.  If  B  is  a  tail  region,  and  /  is  the  density,  importance  sampling  uses  the 
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expression 


0=  [  ^\g{x)dx  =  f  l(x)g(x)dx,  (10) 

Jb  g{x)  Jb 

for  some  “sampling  density”  g  to  get  an  unbiased  estimator  which  is  the  average  of  n  independent 
replicates  (over  the  set  B)  of  the  likelihood  ratio  l(Y),  where  Y  has  density  g.  We  seek  those  g 
for  which  the  cv2  is  bounded  as  the  tail  probability  tends  to  zero. 

The  second  source  of  error  is  statistical:  the  uncertainty  in  the  choice  of  the  model  F  makes 
the  tail  probability  estimate  uncertain,  even  if  there  were  no  numerical  error.  There  are  several 
ways  to  address  this  issue.  One  is  to  introduce  a  plausible  family  of  models,  and  compute  a 
range  of  tail  probabilities  for  that  family.  Another  is  to  follow  the  approach  of  Johnstone  [19], 
for  the  Pearson  family.  He  estimates  the  parameters  of  the  family  from  available  data,  and 
then  provides  an  estimate  of  a  given  quantile  with  its  standard  error.  Yet  another  approach  is 
Bayesian:  first  model  the  uncertainty  in  F  by  putting  a  prior  on  it,  and  then  use  available  data 
to  compute  the  posterior  distribution  of  the  tail  probability. 

We  start  with  the  univariate  case  to  motivate  the  multivariate  case  below.  If  X  has  density 
/,  I’Hopital’s  rule  says  that  with  suitable  regularity,  the  asymptotic  behavior  of  P{X  >  a)/ /(a) 
is  the  same  as  that  of  r(a)  =  -/(a)//'(a).  The  regularity  conditions  are  that  /'(<)  ^  0  for  all 
sufficiently  large  t,  and  that  the  ratio  r(a)  have  a  limit  as  a  — *•  oo;  these  conditions  are  met  in 
many  cases  of  interest.  Writing 

r  n*)dx = r(°)/(a)  r  ^uw)dx’  <n) 

it  is  clear  that  (under  the  same  regularity  conditions)  the  last  integral  in  (11)  approaches  1  as 
n  — ♦  oc;  thus,  it  is  bounded  away  from  0,  and  estimating  it  with  good  relative  accuracy  can 
be  done  using  importance  sampling.  This  heuristic  has  been  extended  by  Gray  and  Wang  [11], 
where  the  generalized  jackknife  is  used  for  evaluating  univariate  tail  probabilities.  The  method 
suggested  below  may  be  regarded  as  a  Monte  Carlo  analog  of  that  procedure. 
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For  the  normal  distribution,  (11)  yields 


so  that  the  relative  error  in  approximating  $(— a)  by  <j>(a)/a  decreases  to  zero  as  a  increases. 

The  phenomenon  observed  in  (16)  is  quite  general:  for  a  wide  class  of  problems,  the  coefficient 
of  variation  actually  tends  to  zero,  hence  the  relative  accuracy  improves  as  the  threshold  a 
increases.  In  addition,  this  method  is  feasible  since  the  calculation  of  r(a)  depends  on  the 
differentiation  of  the  density  rather  than  its  integration;  since  the  behavior  of  the  tail  probability 


is  already  captured  by  r(a)/(a),  the  evaluation  of  the  remaining  integral  by  Monte  Carlo  provides 
a  correction  term.  In  practice,  either  (11)  or  one  of  the  following  two  expressions  for  9  is  also 
useful: 


f(a  +  x/a) 
ar(a)f(a) 


af(a  +  ax) 

Ka)/(a) 


(18) 


Two  other  examples  illustrate  this  technique.  The  first  involves  the  generalized  inverse 


Gaussian  distribution,  whose  density  is 


f(t  1  a,/M)  =  TKx{(aJyi*)tX  'exp  R(a* +  ’ for *  >  °’ 


(19) 


where  K\  is  the  modified  Bessel  function  of  the  third  kind  with  index  A.  The  parameter  space 
is  the  union  of  the  following  three  sets:  {a  >  0,  j3  >  0},  {o  =  0,/3  >  0,  A  <  0},  and  {a  >  0,/3  = 
0,A  >  0}.  This  family  includes  the  gamma,  the  inverse  Gaussian,  the  hyperbola  distribution, 
and  their  reciprocals,  in  the  sense  that  if  X  has  density  f{t  \  a, (3,  A),  then  X'1  has  density 
f(t  |  /3,o,  -  A).  For  the  case  a  >  0,  /?  >  0,  this  method  yields  the  estimator 


9  =  ~f(a  |  a,/?.  A) e^2a(l  +  ^^exp 


af3 


(aa  +  2  T) 


(20) 


for  sufficiently  large  a,  where  T  has  a  standard  exponential  density.  The  second  example  is  the 
t  distribution  with  k  degrees  of  freedom,  with  density  fk(x)  proportional  to  (1  +  x2 / k)~(k+lV2 , 
for  which  the  estimator  is 

Uk  +  a2)Y2^k+1)/2 


0  =  lfk(a) 


k  +  a2Y2 


(21) 


where  Y  has  the  Pareto  density  fc/yfc+1  for  y  >  1.  In  both  cases,  the  cv2  decreases  to  zero  as 
a  ~+  oo.  Detailed  proofs  of  these  and  related  results  are  given  in  [17]. 

We  now  turn  to  the  multivariate  case.  In  1962,  Slepian  [32]  proved  the  following  inequality. 
Let  X  ~  Np( 0,  £  =  (<7jj))  and  Y  ~  Np(0,T  =  (tv,))  with  ct.j  >  r,y  and  =  77,;  then  for  any 
vector  a,  P(X  >  a)  >  P(Y  >  a),  where  x  >  a  means  that  x,  >  a,  for  all  i.  Slepian  derived 
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this  result  using  Plackett’s  identity  (see  Section  4  below)  in  a  study  of  one-sided  boundary 
crossing  problems  for  Gaussian  processes.  Since  Slepian  proved  his  inequality,  his  result  has 
been  generalized  in  a  number  of  ways.  For  instance,  the  inequality  holds  for  ail  elliptically 
contoured  distributions:  see  Das  Gupta,  et  al.  [8]  and  Tong  [34]  for  such  results. 

When  <r,j  >  0  for  all  i  and  j,  the  inequality  Pz(X  >  a)  >  Pi(X  >  a)  yields  a  lower  bound 
which  can  be  easily  computed  for  the  normal,  since  then  it  is  a  product  of  univariate  normal 
probabilities.  However,  this  lower  bound  often  gives  a  poor  approximation  (see  Iyengar  [14]),  so 
that  Slepian ’s  inequality  is  more  useful  for  theoretical  investigations.  Thus,  in  this  section,  we 
describe  alternative  methods  that  provide  good  approximations. 

Suppose  that  X  is  a  p-variate  vector  which  has  an  elliptically  contoured  distribution  with 
density  |E|-5  ^(i'E-1!);  further,  let  term  “tail  region”  refer  to  a  closed  convex  region  B 
that  is  far  from  0  (of  course,  B  should  have  non-empty  interior,  else  the  probability  will  be 
zero).  If  E  =  V L  is  the  Cholesky  decomposition  of  E,  then  Z  —  L~xX  has  the  density 
f(z)  =  f(z;  0,  /)  =  g(z'z),  and  P(X  <E  B)  =  P(Z  6  A  =  L~l B).  Since  A  is  closed  and 
convex,  it  contains  a  unique  point,  a,  that  is  closest  to  the  origin:  ja|<|x|,  for  z  £  A,  and  A  is 
contained  in  the  half  plane  {z  :  z'a  >  ct'a}.  Since  Z  has  a  spherically  symmetric  distribution, 
A  can  be  rotated  so  that  a  =  re\,  where  e\  is  the  unit  vector  in  the  z\  direction,  and  r  =|a|. 
Note  that  r  =  r(A)  depends  upon  the  set  A\  for  notational  convenience,  this  dependence  will  be 
suppressed.  Next,  if  f3  =  La,  then  0  minimizes  the  Mahalanobis  distance,  (z'E-1!)1/2,  of  points 
in  /I  to  the  origin;  also,  B  is  contained  in  the  half  plane  {x  :  x'Y,~l(3  >  /3'E-1/?}.  Of  course, 
the  problem  of  finding  (i  is  a  quadratic  programming  problem  which  can  be  solved  using  known 
techniques.  For  any  set  A,  matrix  D,  and  vector  c,  let  DA  +  c  denote  the  set,  {Dx  +  c  :  x  G  A). 

To  estimate  0  =  P(Z  €  /4),  ordinary  Monte  Carlo  averages  n  independent  replicates  of 
1{Z  €  /t),  where  /  is  an  indicator  function.  This  estimator’s  variance  is  ( 9  —  02)/n.  An 
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alternative  approach  is  to  use  f(z  -  a)  as  a  sampling  function  (Wessel,  et  al.  [35]  refer  to  this 
as  improved  importance  sampling).  The  expression 


t=L7^)f(!-a)dz=L 


IaH 

suggests  the  unbiased  estimator 


/(*  +  <*) 
A-a  f(z) 


f(z)  dz 


(22) 


9  =  /(fz)~/(Z  eA~a'>-  (23) 

If  g  is  a  decreasing  function  —  that  is,  /  is  unimodal,  as  is  the  case  with  the  p-variate  normal 
or  t,  but  not  the  family  given  in  (5)  —  then  f{z)  <  f(z  -  a)  for  z  E  A,  and 


E(&)=  f  , (24) 

Ja  j(z  ~  a) 

so  that  9  has  a  smaller  variance  (and  smaller  cv2)  than  ordinary  Monte  Carlo.  However,  it  can 
be  shown  that  for  several  cases  (the  normal  and  the  t),  the  cv2  tends  to  inhnity  as  a  — ►  oo  (see 
[17]).  Thus,  we  turn  to  multivariate  analogs  of  the  method  described  in  (12)  above. 

Although  a  direct  generalization  of  (12)  is  not  available,  the  analog  is  to  write  Aq  =  A  —  a, 


and 


9  =  /  /(,) dz  =  f(a)  f  dz ,  (25) 

Ja  Ja0  /(a) 

and  to  manipulate  the  ratio  f{z  +  a)//(a)  to  derive  an  estimator  that  has  bounded  cv2  as  the 
region  A  moves  outward  to  infinity.  Just  as  in  the  one-dimensional  case,  there  is  no  generic 
method  that  will  work  for  all  g\  and  unlike  the  one-dimensional  case,  the  shape  of  A  (or  equiv¬ 
alently  the  shape  of  B  and  the  dependence  among  the  random  variables  as  given  by  £)  plays 
an  important  role  in  the  choice  of  sampling  function.  We  now  sketch  the  details  for  the  normal 
and  t  distributions. 

For  the  normal  with  density  <f>p(z)  —  <f>p(z;0,  /),  (25)  becomes 


=  <Mc<)  / 

J  A 


<£p(z  + a)  (2*)(p  x),2<t>p{a) 


Ao  <M°) 


dz  = 


M 


I  |a|  e  |o^'  Z*4>p_i(u)dudzi,  (26) 

J  Aq 
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where  u  =  (z2, . . . ,  zp).  Next,  for  the  t  density  fp(z )  =  fp>l,(z ;  0,  /),  a  slight  modification  of  (12) 


is  needed.  Let  Ai  —  A/  |a|  to  get 


Now  using  the  sampling  density  which  is  proportional  to  on  A\,  we  get 


dz.  (27) 


2 \  |.|2  \  (P+^)/2 


„  _  f. „  /  f(^±  Wa)Ma\ 

-  /p(°)  l«l  JAi  [  y+  |Q|2N2  J 


dz 


Such  expressions  provide  guidelines  on  the  nature  of  the  sampling  function  to  use  for  im¬ 
portance  sampling.  The  specific  choice  depends,  as  mentioned  before,  on  the  nature  of  A, 
specifically,  on  t.he  shape  of  A  near  the  origin  (or  A\  near  the  point  ej).  In  particular,  let 
B  =  {x  :  x i  >  61,12  >  62},  where  the  6,  are  positive;  without  loss  of  generality,  suppose  that 
61  <  62.  When  the  correlation  between  X\  and  X2  is  p,  the  point,  0 ,  that  is  closest  to  the  origin 
(using  Mahalanobis  distance)  is 

* 

(6i,62)  if  />  <  61/62 

a  =  (29) 

(pb  2/2)  ifp>6i/62. 

* 

Transforming  to  the  independent  case  and  rotating  so  that  the  nearest  point,  a,  is  in  the  ei 


direction  gives 


([b'R-'b]1'2^)  if  p  <  61/62 


(62,0) 


if  p  >  61/62. 


The  region  A  is  given  in  Figures  1  for  p  <  6i/62,  and  2  for  p  >  bi/b2.  Since  the  nature  of 
Ao  =  A  -  a  at  the  origin  is  determined  by  the  difference  p  -  61/62,  the  ratio  61/62  will  be 
preserved  in  the  calculations  above:  in  effect,  the  region  B  will  be  moved  outward  towards 
infinity  in  the  direction  of  the  vector  6  =  (61,62). 


{FIGURES  HERE} 
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We  will  now  provide  some  of  the  details  for  the  normal  distribution;  for  a  fuller  account, 


see  [17].  When  the  correlation  coefficient  p  is  not  large  ( p  <  bi/b2  when  b\  <  b2),  the  bivariate 
sampling  function  consisted  of  a  product  of  two  exponential  densities,  and  when  p  is  large,  the 
sampling  function  consisted  of  the  product  of  an  exponential  and  a  normal.  This  is  intuitively 
plausible,  since  for  small  p,  the  bivariate  normal  density  is  not  far  from  the  independent  case, 
while  for  large  p,  it  is  not  far  from  the  singular  case,  for  which  the  exponential  given  in  (113) 
yields  accurate  estimates.  Transforming  back  to  A'  (with  pi2  =  />),  the  estimators  are  given  by 
the  following.  For  p  <  bx/b2, 

<t>l{b\  £)(1  -  p2)2  -T'R~1T/2 

(bi  -  pb2)(b2  -  pbi)  ’  1  ' 

where  T  =  (7\ ,  T2)  has  independent  exponentially  distributed  components  with  mean  vector 
((1  -  P2)/(b\  -  ( 1  -  p2)/(&!  -  pb2)).  And  for  p  >  b^/b2  it  is 

^le-Ta/2/[(T,f/)€  A0],  (32) 

b2 

where  T  and  U  are  independent  with  densities  |a|  e~^  and  (f>(u ),  respectively,  and  Ao  = 
A  -  (&2*0)  is  the  translate  of  the  set  given  in  Figure  2.  For  both  of  these  cases,  it  can  be  shown 
that  the  cv2  for  the  estimators  given  above  all  tend  to  zero  as  a  — ►  oo,  that  is,  as  the  tail 
probability  diminishes.  The  proof  for  the  normal  case  is  given  in  [17].  We  omit  the  proof  for 
the  t  distribution.  Instead,  we  turn  to  the  key  quantity  that  is  used  in  the  proofs,  Mills’  ratio. 

Several  definitions  of  the  multivariate  normal  Mills’  ratio  are  available.  The  first  definition 
is  due  to  Savage,  [29]  for  the  case  of  orthants: 


A/i  (B;R)  = 


P{ X  g  B) 
<t>p{b]  R)  ’ 


(33) 


for  X  ~  Np( 0.  R).  Another  definition  is  gotten  by  first  transforming  to  the  spherically  symmetric 
case  with  Z,  A,  and  a  replacing  X ,  B,  and  /?  respectively.  For  r  =|a|  let 

P(ZeA) 


M2(A;I)  = 


4>(r) 


(34) 
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This  definition  applies  to  convex  regions  A ,  not  just  orthants.  However,  the  two  definitions  do 


not  coincide  when  B  is  an  orthant.  For  R  ^  7, 


m  (a  n-  P(Z£A)  _(  2*  \ 
2(  ’  }  (2tt )(r>-W<t>p{cc,I)  VmJ 


y/2  P(X  €  B) 
^R\)  MP;R)  ' 


so  that  the  two  definitions  differ  in  two  respects.  First,  in  place  of  /?,  it  uses  the  vertex  6; 
for  example,  when  (61,62)  =  (1.2)  and  p  =  0.95,  =  (1-9,2).  This  is  an  important 

difference,  because  when  the  correlation  is  high,  importance  sampling  centered  at  6  can  be  much 
worse  than  that  centered  even  at  the  origin  (see  [17]).  Second,  the  new  definition  has  the  factor 
(2n /  |  2nR  I)1/2;  this  is  not  an  important  difference,  but  it  does  mean  that  proper  comparisons 
of  the  two  must  first  adjust  for  this  factor. 

For  the  multivariate  normal,  the  following  inequalities  for  M2  generalize  (15): 


M2(A;I)<-P[(TtU)€Ao], 

r 


1  ft 2 

M7(A\I)  >  -  P[(T,U)ZA0}-  —re~Tt 4>{u)dudt 

7*  JAo  & 

>  -  P[(T,U)e  Ao]-  n  ^rre-Tt4>{u)dudt 
r  Jo  2 

=  i  [/>((!-,  </)  e  >t„]  -  , 


where  ( T,U )  is  as  in  (32).  When  A  =  L~XB,  where  B  is  a  quadrant,  explicit  expressions  for 
the  bounds  in  (36)  and  the  first  line  of  (37)  are  available.  Such  inequalities  are  not  available  for 
Mi .  These  inequalities  are  used  in  [17]  to  prove  that  the  estimators  in  (31)  and  (32)  have  cv2 
tending  to  zero  as  a  — ►  00. 

Mills’  ratio  for  elliptically  contoured  densities  are  defined  analogously:  the  numerator  is 
P(X  €  B ),  while  the  denominator  is  either  <f>p(b;R)  or  <t>p(/3;R)  for  Mi  and  M2,  respectively. 
In  [9],  Fang  and  Xu  give  a  detailed  account  of  Mi  They  show  that  if  X  has  an  elliptically 
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contoured  distribution  given  by  (2),  where  g  is  a  non-increasing  function,  then  the  function 
-P(X  £  B)  is  a  Schur  convex  function;  they  use  this  fact,  along  with  standard  majorization 
results  to  provide  inequalities  for  M\.  A  detailed  study  of  the  analog  of  M%  for  other  elliptically 
contoured  distributions  has  not  yet  been  done. 


4  Computation  of  Moments 


In  his  paper,  Brillinger  [4]  noted  that  a  moment  generalizes  the  notion  of  a  probability,  since 
the  latter  is  the  first  moment  of  an  indicator  function,  which  is  a  building  block  of  integrable 
functions.  Here,  we  use  the  term  moment  to  denote  the  expected  value,  when  it  exists,  of  some 
function  of  a  random  vector,  that  is,  E[g(X)]  =  £[ff(Xi, . .  ,,-Xp)].  Conventionally,  (product) 
moments  are  defined  as  E  [nf=i  ] >  where  k{  are  non-negative  integers.  In  this  section,  we 
discuss  three  methods  for  computing  moments  for  elliptically  contoured  distributions.  The  first 
uses  the  characteristic  function  when  it  is  available,  the  second  uses  the  stochastic  representation 
(1)  when  the  moments  of  r  are  available,  and  the  third  uses  several  partial  differential  equations 
that  are  given  below.  Throughout,  let  X  =  fi  -}-  tE1/2Up,  as  in  (1). 

The  first  two  methods,  which  are  due  to  Li  [23],  are  of  course  equivalent;  computational 
convenience  dictates  the  choice  of  method.  Let  the  k ^  moment  (when  it  exists)  of  the  vector 
X  be  given  by  the  matrix  T*(X),  where 


r*(*)  =  UkJ)  =  ! 


if  k  is  even 


(38) 


[  E[X®X'®X...®X’®X]  if  it  is  odd, 

where  ®  denotes  the  Kronecker  product,  which  has  fc  terms  in  (38).  This  definition  reduces  to  the 
usual  mean  vector  and  covariance  matrix  when  k  =  1  and  2,  respectively;  ri(A)  =  n  whenever 
the  first  moment  exists.  For  k  >  3,  the  following  recipe  tells  us  where  to  find  E  [n?=i  A’f' ]  (with 
Hf_i  k,  =  k)  in  IV  A):  if  the  terms  in  the  product  are  strung  out  thus,  7^  =  E(X^X^  . . .  Al(t), 


1.56 


then 


r 


=  1  + 


K*+l)/2] 

E  (‘2i-l  -  1)  pl('C+1)/2,_j 


i= i 


(39) 


and 


[*/2] 

*  =  1+  E(l'2i-i-l)Plfc/2]_i,  (40) 

i=i 

where  [a]  is  the  greatest  integer  in  a. 

Using  this  notation,  the  matrices  IT(.X’)  can  be  expressed  in  two  ways.  First,  if  the  charac¬ 
teristic  function  is  known,  repeated  differentiation  of  it  gives  the  following  expressions  for  A:  =  2 
and  3: 


IV*)  =  20/(O)£, 

^(A)  =  p  ®  p!  ®  p  -  2^'(0)[/x  ®  E  +  E  ®  p  +  vec(E)//],  (41) 

where  vec(E)  =  (011,021, . .  .,opi, . .  .,erlp, . . .  ,opp)'  strings  out  the  columns  of  S  into  one  long 
vector. 

This  formulation  is  useful  for  the  family  (5),  for  the  characteristic  function  is  given  by 

<«> 

so  that  -20'(O)  =  n(2 fc  +  p)/2p.  A  proof  of  this  result  is  given  in  Iyengar  and  Tong  [15].  When 
the  characteristic  function  is  not  available,  but  the  moments  of  r  are  available,  the  representation 
(for  p  =  0  and  £  =  /)  X  =  rUp  implies  that  r'jt(A')  =  TkTk(Up).  Since  Tk(Up)  can  be  derived 
from  the  known  properties  of  the  normal  distribution,  -2^>'(0)  is  replaced  by  E(r2)/p  in  (41). 
For  instance,  for  the  multivariate  t,  the  characteristic  function  is  intractable,  but  the  density  of 
r  is  proportional  to 

rp_1(l  +  r2/i/)~(p+")/2,  r  >  0,  (43) 

which  yields  the  finite  moments  upon  integration.  Expressions  for  the  fourth  moment  1%  that 
involve  V’"(0)  or  E(r4)  are  given  in  [23];  even  higher  order  moments  can  be  computed  along 
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the  lines  outlined  there.  Since  quadratic  forms  in  elliptically  contoured  distributions  arise  in 
standard  testing  procedures  (see  Anderson  and  Fang  [2]),  Li  also  provides  expressions  for  their 
moments. 

In  a  related  study,  Xu  and  Fang  [36]  define  an  n  x  p  matrix  has  a  matrix  elliptically  contoured 
density  if  TX  has  the  same  distribution  as  X  for  every  n  x  n  orthogonal  matrix  T.  The  density 
then  has  the  form  cnpf(x'x);  if  Y  =  XU1/2  for  a  p  x  p  covariance  matrix  E,  the  density  of  Y  is 
given  by 

cn,p  |E|-n/2  /(E-^VyE-1'2).  (44) 

In  their  paper,  Xu  and  Fang  give  the  expected  values  of  zonal  polynomials  and  other  symmetric 
functions  of  W  =  Y'Y .  The  expressions  are  rather  involved,  so  we  omit  them. 

The  third  method  of  computing  moments  has  a  longer  history.  In  1958,  Price  [27]  proved 
the  following  result.  Let  Np(p,  E)  denote  a  p- variate  normal  with  mean  p  and  covariance  matrix 
E  =  Suppose  that  X  =  (Xj, . . . ,  Xp)  has  a  jVp(p,  E)  distribution  (written  X  rsj  NP(n,m 
and  let  yi(A"i), . . .  ,gp(Xp)  be  differentiable  functions  of  the  components  of  X,  each  admitting  a 
Laplace  transform;  then 

Conversely,  if  this  identity  holds  for  arbitrary  ffi,---,gp  (with  both  expectations  above  defined) 
then  X  has  a  multivariate  normal  distribution.  Price  and  others  used  this  theorem  to  facilitate 
studies  in  signal  processing.  In  particular,  suppose  that  a  zero-memory  non-linear  input-output 
device  with  Gaussian  input  X{  that  yields  output  <7<(X\).  The  p^-order  correlation  coefficient 
of  the  outputs  is  a  quantity  of  interest  which  requires  the  computation  of  the  expectation  of 
11?  9k(Xk)-  The  differential  equation  of  Price’s  theorem  provides  a  useful  computational  tool  for 
such  calculations.  Consider  the  following  trivial  example:  if  h(p)  =  E(X  1X2),  where  p  12  =  p 
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is  the  correlation  between  the  standardized  variates  X\  and  X 2,  then  h'(p)  =  1,  and  h(p )  =  p 
follows. 

Although  Price’s  theorem  is  an  elegant  result,  it  has  several  limitations.  In  fact,  Pawula 
[25]  (see  also  Papoulis  [24])  noted  that  when  p  =  2,  and  the  right  hand  side  of  (45)  can  be 
evaluated  explicitly,  there  is  a  single  differential  equation  to  solve.  But  for  larger  p,  there  are 
p(p-  l)/2  differential  equations  to  solve  simultaneously.  Furthermore,  Price’s  result  only  applied 
to  a  product  of  functions  of  individual  components  only.  Pawula  used  a  result  of  Plackett  [26]  to 
overcome  these  limitations.  In  1954,  Plackett  proved  the  following  identity  while  investigating 
a  reduction  formula  for  multivariate  normal  probabilities:  if  the  density  of  a  jVp(p,  E)  variate  is 
4>p(x  —  p,  E),  then 

for  i*j.  (46) 

For  the  case  i  -  j,  we  have  the  diffusion  equation 

(47) 


Pawula  used  Plackett’s  identity  to  extend  Price’s  theorem  thus:  if  g(x\, . . .  ,xp)  is  sufficiently 
smooth  and  vanishes  rapidly  near  infinity,  then 


d 

dcrtJ 


E[g(Xu.. 


Xp)) 


E 


d 2 

diidxj 


g(Xi, . .  -  ,xp) 


for  i  j. 


(48) 


This  extension  allowed  the  study  of  more  general  functions,  such  as  the  “linear  rectifier  correla¬ 


tor,”  g(x i,x2)  =|xi  +  x2\  -  1*1  -  x2\. 

Pawula  then  used  the  following  method,  also  due  to  Plackett,  to  reduce  the  number  of 
differential  equations  to  solve  from  p(p-  l)/2  to  one.  For  a  given  E  define  a  line  between  it  and 
the  identity  matrix  I,  E*  =  (1  —  t)I  +  tE  for  0  <  /  <  1.  The  chain  rule  then  gives 

d  d 2 

— d>p(r  -  p;  Et)  =  -M;Et),  (49) 
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so  that 


jtEt\g{Xu-:,Xp)]  =  Et 


d 2 


*<j 


dxidxj 


g{X  i,...,xp) 


(50) 


where  Et  denotes  the  expectation  with  respect  to  N(p,  Ef).  When  the  right  hand  side  of  (50)  can 
he  evaluated,  a  single  ordinary  differential  equation  results.  By  solving  it,  Pawula  showed  how 
to  compute  the  moments  of  various  functions  of  X,  such  as  products  of  Hermite  polynomials  or 
error  functions.  In  some  cases,  higher  order  derivatives  with  respect  to  t  are  needed:  they  are 
just  iterates  of  the  partial  differential  operator  on  the  right  of  (50). 

The  search  for  bounds  for  certain  probabilities  and  expectations  has  recently  led  to  several 
generalizations  of  Plackett’s  identity  to  elliptically  contoured  distributions.  The  first  is  a  result 
of  Joag-dev,  et  al.  [18]  which  only  requires  that  g  in  (2)  be  differentiable: 

/i,  s)  =  ~(£  °ik*k)H*;  n,  s),  (si) 

aa'i  oxj  fc_x 


where  o'k  is  the  i,k  element  of  E_1.  Another  is  due  to  Iyengar  ([12],  see  also  Iyengar  and  Tong 


[15]),  who  proved  the  following  identity  for  fpy. 


d 

dpi: 


fp,k(x\ 


Tj  y,  k\  T(|  +  m ) 


d 2 

dxidxj 


fP,m(x;p,Z,Ti)- 


(52) 


This  specializes  to  Plackett’s  identity  when  k  =  0.  Finally,  Gordon  [10]  proved  a  definitive 
version  of  Plackett’s  identity  for  elliptically  contoured  densities  (the  proof  of  which  he  traced 
back  to  [8,18]).  He  showed  that  the  following  two  statements  about  functions  g  and  h,  each 
mapping  IR+  into  itself  and  vanishing  at  oo,  are  equivalent: 


(53) 


and 


_d_ 

do 


gz(x) 


d 2 

dxidxj 


hz{x), 


(54) 
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where  <7e(^)  =  |S| ~ 1  /2  and  similarly  for  h.  When  g  is  an  exponential  or  an  appropri¬ 

ately  chosen  gamma  density  the  identities  of  Plackett  and  Iyengar,  (46)  and  (52),  respectively, 
follow.  Next,  for  the  p- variate  t  with  u  degrees  of  freedom,  we  have 


m  = 


r((p  +  v)/2)  v  ,_(p+1/_2)/2 

(rv)p/*r(v/2)  (p+i/-  2)^  T  ; 


(55) 


These  extensions  of  Plackett’s  identity  have  been  used  principally  for  theoretical  investiga¬ 
tions,  in  particular,  for  studying  the  nature  of  the  dependence  among  the  components  of  X .  A 
systematic  study  of  their  use  for  the  computation  of  moments  of  various  functions  (other  than 
the  usual  product  moments  given  by  T*)  has  not  yet  been  done.  The  mathematical  basis  for 
Plackett’s  identity  goes  back  to  the  19^  century  work  of  Schlafli  [31]  on  hyperspherical  sim- 
plices,  and  the  later  work  of  the  geometer  Coxeter  [6].  For  more  on  the  geometrical  aspects  of 
Plackett’s  identity  and  related  issues,  see  Abrahamson  [1]  Iyengar  [16]  and  Ruben  [28]. 


5  Conclusion 

In  this  paper,  we  have  discussed  recent  developments  in  probability  and  moment  calculations 
for  elliptically  contoured  distributions.  These  developments  should  allow  the  use  of  models  other 
than  the  multivariate  normal  for  high  dimensional  data.  Clearly,  much  more  work  needs  to  be 
done.  For  instance,  since  Monte  Carlo  is  an  increasingly  popular  method  for  assessing  the 
performance  of  various  systems,  a  more  systematic  study  of  appropriate  sampling  functions  is 
needed.  Only  the  beginnings  of  such  a  study  are  given  here. 


6  Acknowledgment 

This  research  was  partly  supported  by  the  Office  of  Naval  Research  grant  N00014-89-J-1496. 
This  paper  was  written  while  I  was  visiting  the  Statistics  Department  at  Carnegie- Mellon  Urn 


161 


versity.  I  thank  the  department  for  the  use  of  their  facilities. 


References 

1.  Abrahamson,  I.  (1964)  Orthant  Probabilities  for  the  Quadrivariate  Normal  Distribution.  An¬ 
nals  of  Mathematical  Statistics.  35,  1685-1703. 

2.  Anderson,  T.W.  and  Fang,  K.T.,  eds.  (1990)  Statistical  Inference  in  Elliptically  Contoured 
and  Related  Distributions.  Allerton  Press.  New  York. 

3.  Andrews,  D.  (1973)  A  General  Method  for  the  Approximation  of  Tail  Probabilities.  Annals 
of  Statistics.  1,  367-372. 

4.  Brillinger,  D.  (1992)  Higher  Order  Moments  and  Spectra.  This  volume. 

5.  Cambanis,  S.,  Huang,  S.,  Simons,  G.  (1981)  On  the  Theory  of  Elliptically  Contoured  Dis¬ 
tributions.  J.  of  Multivariate  Analysis.  11,  368-385. 

6.  Coxeter,  H.  (1973)  Regular  Polytopes.  Dover.  New  York. 

7.  Cramer,  H.  (1946)  Methods  of  Mathematical  Statistics.  Princeton  U.  Press.  Princeton. 

8.  Das  Gupta,  S.,  Eaton,  M.,  Olkin,  I.,  Perlman,  M.,  Savage,  J-,  Sobel,  M.  (1972)  Inequalities 
on  the  Probability  Content  of  Convex  Regions  for  Elliptically  Contoured  Distributions.  Proceed¬ 
ings  of  the  Sixth  Berkeley  Symposium  on  Mathematics,  Statistics,  and  Probability.  2,  241-264. 

9.  Fang,  K.T.,  Xu,  J.L.  (1990)  The  Mills’  Ratio  of  Multivariate  Normal  Distributions  and 
Spherical  Distributions.  In  Statistical  Inference  in  Elliptically  Contoured  and  Related  Distribu¬ 
tions.  K.T.  Fang  and  T.W.  Anderson,  eds.  Allerton  Press.  New  York,  457-468. 

10.  Gordon,  Y.  (1987)  Elliptically  Contoured  Distributions.  Probability  Theory  and  Related 
Fields.  76,  429-438. 

11.  Gray,  H.  and  Wang,  S.  (1991)  A  General  Method  for  Approximating  Tail  Probabilities.  J. 
of  the  American  Statistical  Association.  86,  159-166. 

12.  Iyengar,  S.  (1984)  A  Geometric  Approach  to  Probability  Inequalities.  University  of  Pitts¬ 
burgh  Technical  Report. 

13.  Iyengar,  S.  (1986)  On  a  Lower  Bound  for  the  Multivariate  Normal  Mills’  Ratio.  Annals  of 
Probability.  14,  1399-1403. 

14.  Iyengar,  S.  ( 1988)  Evaluation  of  Normal  Probabilities  of  Symmetric  Regions.  SIAM  Jour- 


162 


rial  of  Scientific  and  Statistical  Computing.  9,  418-424. 

15.  Iyengar,  S.,  Tong,  Y.  (1990)  Convexity  of  Elliptically  Contoured  Distributions  with  Appli¬ 
cations.  Sankhya  A.  51,  13-29. 

16.  Iyengar,  S.  (1990)  Plackett’s  Identity,  its  Generalizations,  and  their  Uses.  To  appear  in  the 
Festschrift  for  Charles  Dunnett. 

17.  Iyengar,  S.  (1990)  The  Approximation  of  Tail  Probabilities.  U.  of  Pittsburgh  Technical 
Report. 

18.  Joag-Dev,  K.,  Perlman,  M.,  Pitt,  L.  (1983)  Association  of  Normal  Random  Variables  and 
Slepian’s  Inequality.  Annals  of  Probability.  11,  451-455. 

19.  Johnstone,  I.  (1986)  A  Program  for  Estimating  Uncertainties  in  Quantile  Estimates  Derived 
from  Empirical  Pearson  Fits.  Stanford  U.  Technical  Report. 

20.  Kotz,  S.  (1974)  Multivariate  Distributions  at  a  Cross-Road.  Statistical  Distributions  in 
Scientific  Work.  Vol.  1.  G.  P.  Patil,  S.  Kotz,  J.K.  Ord,  eds.  247-270. 

21.  Lavine,  M.  (1991)  Problems  in  Extrapolation  with  Space-Shuttle  O-Ring  Data.  J.  of  the 
American  Statistical  Association.  86,  919-921. 

22.  Lawless,  J.  (1982)  Statistical  Models  and  Methods  for  Lifetime  Data.  Wiley.  New  York. 

23.  Li,  G.  (1990)  Moments  of  a  Random  Vector  and  its  Quadratic  Forms.  In  Statistical  Infer¬ 
ence  in  Elliptically  Cor*oured  and  Related  Distributions.  K.T.  Fang  and  T.W.  Anderson,  eds. 
Allerton  Press.  New  York,  433-440. 

24.  Papoulis,  A.  (1965)  Probability,  Random  Variables,  and  Stochastic  Processes.  McGraw- 
Hill.  New  York. 

25.  Pawula,  R.  (1967)  A  Modified  Version  of  Price’s  Theorem.  IEEE  Transactions  on  Infor¬ 
mation  Theory.  IT  13,  285-288. 

26.  Plackett,  R.  (1954)  Reduction  Formula  for  Multivariate  Normal  Integrals.  Biometrika,  41, 
351-360. 


27.  Price,  R.  (1958)  A  Useful  Theorem  for  Non-linear  Devices  Having  Gaussian  Inputs.  IRE 
Transactions  on  Information  Theory.  IT  4,  69-72. 

28.  Ruben,  II.  ( 1954)  On  the  Moments  of  Order  Statistics  in  Samples  from  Normal  Populations. 
Biometrika.  41,  200-227. 

29.  Savage,  I.R.  (1962)  Mills’  Ratio  for  Multivariate  Normal  Distributions.  J.  of  Research  of 


163 


the  National  Bureau  Standards.  B  66,  93-96. 

30.  Scharf,  L.  (1991)  Statistical  Signal  Processing.  Addison- Wesley.  New  York. 

31.  Schlafii,  L.  (1858,1860)  On  the  Multiple  Integral  /”  dxdy . .  .dz,  whose  Limits  are  pi  = 
a\X  +  b\y  +  . . .  +  h\z  >  0,p2  >  0, . . .  ,pm  >  0,  x2  +  y2  -f  . . .  z1  <  1.  Quarterly  J.  of  Pure  and 
Applied  Mathematics.  2,  261-301;  3,  54-68;  3,  97-107. 

32.  Slepian,  D.  (1962)  The  One-Sided  Barrier  Problem  for  Gaussian  Noise.  Bell  System  Tech¬ 
nical  J.  41,  463-501. 

33.  Thisted,  R.  (1988)  Elements  of  Statistical  Computing.  Chapman  and  Hall.  New  York. 

34.  long,  Y.L.  (1990)  The  Multivariate  Normal  Distribution.  Springer.  New  York. 

35.  Wessell,  A.,  Hall,  E.,  and  Wise,  G.  (1988)  Some  Comments  on  Importance  Sampling. 
Proceedings  of  the  22nd  Conference  on  Information  Sciences  and  Systems. 

36.  Xu,  J.L.,  Fang,  K.T.  (1990)  The  Expected  Values  of  Zonal  Polynomials  of  Elliptically  Con¬ 
toured  Distributions.  In  Statistical  Inference  in  Elliptically  Contoured  and  Related  Distributions. 
K.T.  Fang  and  T.W.  Anderson,  eds.  Allerton  Press.  New  York,  469-479. 


164 


Legends  for  the  two  Figures 


FIGURE  1:  p  <  61/62;  A  is  bounded  by  and  L2. 

L, :  Z2  =  h(M~pb^'(zi  ~ ( b'R~H )1/2)’  for  Zl  -  (b'R~Hf/2 
L2  :  zi  =  (*i  "  (b'R-'b)1'2),  for  zx  >  (b'R-'b)1'2 

FIGURE  2:  p  >  61/62;  A  is  bounded  by  Li  and  L2. 


Li  :  Z2 


(pzi  -  61) 

(1  -p2)1/2’ 


for  zi  >  62 


L2: 


zi  =  62,  for  Z2  < 


(p62  -  61) 
(i  - 
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Abstract 

In  this  paper  we  discuss  the  problem  discriminating  among  various  non-linear  time 
series  models.  While  the  method  we  propose  is  of  a  general  nature  we  consider  a  re¬ 
stricted  class  of  models  that  share  an  identical  AR(1)  equivalent  correlation  function 
structure  ;hence.  identical  spectral  density.  Consequently,  the  possibility  of  discriminat¬ 
ing  among  them  on  the  basis  of  second  order  moments  is  theoretically,  and  practically, 
impossible.  The  approach  being  taken  is  aimed  at  discriminating  among  the  models 
on  the  basis  of  higher  order  moments  i.e.  the  higher  order  cumulant  structure.  Specif¬ 
ically,  we  shall  focus  on  the  .T^-order  cumulant  structures  as  our  initial  step  beyond 
the  conventional  covariance  structure. 

Key  Words  :  Time  series.  Linear,  Non-linear.  Gaussianity,  Stationarity,  Au¬ 
toregressive,  Exponential  Models.  EAR(l),  ARE(l),  EAR(l),  TEAR(l),  NEAR(l), 
Robertson’s  Fixed  and  Random  Models.  Correlation  and  Cumulant  Structure. 
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1  Introduction 


Statistical  methods  based  on  moment  information  have  b<*eii  used  extensively.  In  terms  of 
model  identification  the  time  series  literature  has  been  devoting  a  considerable  attention  to 
the  problem  of  ident  ifying  the  p  and  q  order  under  the  general  linear  framwork  of  ARMA(p,q) 
modelling.  Second  order  correlation  informat  ion  (e.g.  acf  and  pacf )  became  a  main  tool  in  the 
process  of  of  selecting  p  and  q.  While  second  order  information  is  of  paramount  importance 
in  the  case  where  the  roots  of  the  AR  and  MA  polynomials  remain  outside  the  unit  circle, 
higher  order  cumulant  information  becomes  crucial  in  deciding  on  the  locations  of  the  zeros 
or  poles  of  possibly  non-invertible,  noil-causal  and  non-Gaussian  ARMA  models.  Of  course 
there  are  many  very  useful  statistical  tools  for  solving  the  above  mentioned  problems  which 
an*  not  based  on  moments.  For  example,  the  use  of  information  based  criteria  such  as  AIC, 
MAIC  and  BIG  in  selecting  orders  of  an  ARMA  model,  the  use  of  MLE  in  locating  roots 
of  a  mixed  phase  ARMA  process,  ect.  While  these  non-moment  based  methods  might  be 
more  efficient  than  moments  methods,  the  moments  methods  are  generally  simpler,  easier 
and  intuitvely  appealing  both  in  theory  and  computation.  It  is  often  the  case  that  one  needs 
the  initial  point  supplied  by  such  a  method  to  start  an  efficient  but  complicated  non-moment 
based  method. 

I  he  introduction  of  non-linear  time  series  models  in  recent  years  (e.g.  bilinear,  threshhold. 
random  coefficient,  ect.)  amplified  the  importance  of  using  higher  order  cumulant  informa¬ 
tion  in  discriminating  among  the  various  non-linear  models.  It  was  shown  that  different 
models  are  capable  of  producing  an  identical  correlation  function  of  the  linear  autoregressive 
type:  thus,  giving  rise  to  a  class  of  models  characterized  as  '2'rf-order  equivalent.  Conse¬ 
quently.  effort  s  have  been  diverted  to  t  lie  analysis  of  t  ho  higher  order  cumulant  structure  wit  h 
the  hope  of  exploiting  differences  among  the  models  at  higher  order  correlation  dependency 
st  ruct  ure.  I  he  basic  idea  underlying  t  lie  search  for  informat  ion  in  the  higher  order  cumulant 
structure  in  order  to  distinguish  two  models  may  be  stated  as  follows.  Within  the  class  of 
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moment,  determination,  moment  sequences  of  two  different,  stochastic  processes  cannot  be 
identical.  Specifically,  given  two  stationary  series  {.V,}  7^  {Y,}  there  exists  (ui,  u2,  ■  ■  ■ ,  »*) 

such  that  -order  moments  or  cumulants  with  lags  (11^112 . uk)  of  {.Y(}  and  {Vt}  ar<> 

not  equal,  i.e. 

"2 . Uk)  ±  Cy(u,,</2, - uk). 

In  practice',  one  hopes  that  the  above  is  true  for  a  small  order  k,  and  the  difference  is  large 
relative  to  a  given  sample  size.  Otherwise,  the  search  for  a  discriminatory  power  in  the  higer 
order  eumulant  structure  might  turn  out  to  be  fruitless. 

The  problem  of  discrimination  among  non-linear  time  series  models  has  been  considered  by 
many  authors.  Lawrence  and  Lewis  [24]  considered  special  3rJ-order  structure  of  the  form 

ror(R^,.\f+h)  .  roc([/?|p)]2,.V/+,) 

where'  fi\r)  are  the  linear  autoregressive  residuals  of  order  p  for  RCA  and  PAR  models  . 
Within  the  class  of  bilinear  models  Li  [2b]  and  Ciabr  [10]  considered  quantities  of  the  form 

Cor(Xf,Xf+k) 

C'or(.\f,Xl+k) 

respectively.  A  lie's  tad  and  Tjostheim  [J]  considered  the  use  of  non-parametric  methods  aimed 
at  the'  condit  ional  mean  and  variance  of  various  non  linear  time  series  models.  Anderson  [lj 
approached  this  problem  differently  by  observing  differences  in  the  sample  paths  generated 
by  the  exponential  family.  I ’sing  a  fluctuating  type  statistic  he  was  able  to  discriminate 
among  simulated  trace's  for  a  reasonable  number  of  observations.  In  his  work  the  moments 
do  not  play  a  role  in  the  proposed  discrimination  procedure  and  as  such  may  provide  an 
alternative'  in  situations  where  moments  up  to  the'  desired  order  do  not  exist.  Isay  [37] 
offers  a  very  general  method  for  selecting  a  model  depending  on  the'  type  of  characteristic 
one'  is  intere'sle-d  to  investigate. 

We  propose  a  new  approach  which  relic's  on  the  conjecture  that  the  information  required  for 
discrimination  among  the  models  is  available'  in  the>  highe'r  oreler  moments  or  equivah'iitly. 
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in  the  higher  order  rurnulant  structure.  Specifically,  we  shall  concentrate  our  attention  on 
the  third  order  rurnulant  structure  given  by 

Hr,*)  =  K[(Xt  -  llr)(X,-r  -  -  flx)].  (1.1) 

The  family  of  exponential  time  series  models  will  be  the  framework  within  which  we  shall 
show  the  parametric  equality  of  the  correlation  (hence,  the  spectral  density)  functions,  and 
the  way  in  which  the  theoretical  higher  order  rurnulant  structure  points  out  to  the  differences 
among  the  models.  We  demonstrate  the  method  for  a  restricted  case  where  we  consider  a 
family  of  non-linear  time  series  models  with  known  marginal  distributions  and  a  common 
A R ( l )  equivalent  correlation  structure.  This  family  consists  of  marginal  exponentially  dis¬ 
tributed  t  ime  series  models  which  include  : 

•  (i)  Product  Autoregressive  Model  [  PAR(l)  ] 

•  (ii)  (Exponential  Autoregressive  Model  [  EAR(l)  ] 

•  (iii)  Transposed  Exponential  Autoregressive  Model  [  TEAR(l)  ] 

•  ( i v )  Newer  Exponential  Autoregressive  Model  [  NEAR(l)  ] 

•  ( v )  Robertson  s  Fixed  Model 

•  (vi)  Robertson's  Random  Model 

In  addition  wo  shall  consider  the  linear  autoregressive  model  with  exponential  innovation 
process  which  we  shall  call  ARE(  I).  As  opposed  to  the  family  mentioned  above  the  ARE(l) 
does  not  have  a  known  marginal  distribution  jhowever,  its  moments  can  be  computed.  This 
model,  though,  shares  the  same  correlation  structure  as  the  non-linear  exponential  family, 
i  he  underlying  objective  is  to  discriminate  among  realizations  produced  by  the  models 
we  consider.  I  his  task  is  impossible  to  accomplish  since  they  have  identical  second  order 
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structure.  We  should  note  that  the  models  helm;  considered  by  no  means  exhaust  all  such 
second  order  equivalent  ones. 

The  plan  for  the  paper  is  as  follows,  we  shall  start  with  a  briel  review  ol  the  traditional 
approach  to  time  series  analysis  .  followed  hy  a  presentation  ol  the  family  ol  exponential 
time  series  models  in  section  T  In  that  section  we  shall  state  the  lorm  taken  up  hy  each 
model,  show  the  type  of  sample  traces  they  are  capable'  of  producing  and  develop  their 
correlation  functions.  Then  we  give  a  brief  review  ol  higher  order  cumulauts  in  section  1. 
Subsequently,  the  results  we  obtained  for  the  T^-order  cumulant  structure'  for  the'  seven 
models  under  considerat  urns  are  presented  in  section  •">.  (ieneral  methodology  is  presented  in 
section  ti.  I  he  results  ol  the  simulation  part  are  the  topic  ol  section  7.  There  we  also  briefly 
discuss  tin'  way  in  which  the  sample  trace's,  correlation  functions  and  .{''border  cumulauts 
were  generated  empirically.  A  briel  conclusion  is  given  in  seel  ion  8. 

2  Stationarity,  Linearity  and  Gaussianity 

Over  t  he  last  ol)  years  stat  ist  icians  have  developed  a  large'  body  ol  t  liexrrv  and  met  hods  aimed 
at  the  analysis  of  time  series  data.  A  comprehensive'  account  of  their  work  culminated  in 
books  sm  h  as  Kendall  and  Stuart  [IT],  Jenkins  and  Watts  [lb],  Box  and  Jenkins  [’>],  Hannan 
[Uj.  Anderson  Brilliuger  [7],  Chat  held  [9],  Koopmaus  [18],  Priestley  [JO],  Rosenblatt 
[Tjj.  and  Brockwe'll  and  Davis  [8j.  to  name  a  few.  The  foundations  ol  classical  time  series 
analysis,  as  describe'd  in  the  above'  references,  were  thought  to  be-  based  on  two  underlying 
assmtipt  ions,  st  at  ing  t hat  : 

I.  1  he  time  series  is  sfntionanj  to  an  order  ol  at  least  two.  I  he  proce'ss  is  assumed  to 
remain  in  equilibrium  about  a  constant  mean  level  with  the  proportion  ol  ordinal <'s  not 
exceeding  any  given  level  is  about  equal  over  any  t ime  interval  spanned  by  the  sample. 
In  case  the  observed  series  does  not  exhibit  such  behavior,  it  is  further  assumed  that 
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weak  stationarity  ran  be  achieved  by  applying  an  appropriate  I ransformation  e.g.  linear 
filtering. 

2.  The  time  series,  viewed  as  a  stochastic  process,  {A,,/  €  /’},  is  an  output  from  a 
lintar  filter  whose  input  is  the  white  not  si  process  {'/, };  lienee,  the  observed  sample 
realization  can  be  represented  as  a  linear  function  of  past,  and  present  values  of  {Ztj  - 
a  one  sided  representation. 

In  recent  years  the  validity  of  these  twin  assumptions  -  as  reasonable  approximations  to 
sample  trace  realizations  -  has  been  questioned  as  data  from  a  wider  variety  of  sources 
became  available.  Coupled  with  advances  in  t  he  held  of  non-linear  dynamics  (deterministic 
chaos  theories),  research  in  the  field  of  non-stationary,  non-linear  and  non-CJaussian  time 
series  methodology  have  been  in  progress.  Subsequent  efforts  to  bring  non-linear  time  series 
literature  under  one  unified  framework  resulted  in  the  publication  of  books  like  Priestley 
[29]  and  Tong  [36].  The  reader  is  also  referred  to  Mohler  [28]  for  a  collection  of  papers  on 
theory,  computational  methods  and  applications  in  the  area  of  non-linear  signal  processing, 
long  [36]  discusses  properties  of  the  Gaussian  stationary  linear  model  (GSLM)  which  may 
possibly  be  violated  : 

•  (a)  I'ime  series  that  exhibit  strong  asymmetric  behavior  cannot  be  expected  to  confirm 
to  the  GSLM.  Such  models  are  characterized  by  symmetric  joint  cumulative  density 
functions  and  that  rules  out  asymmetric  sample  realizations. 

•  (b)  I  he  GSLM  does  not  give  rise  to  clusters  of  outliers  e.g.  sudden  bursts  of  large 
magnitudes  at  irregular  time  intervals.  Observed  time  series  in  socio-economic  related 
phenomena  do  tend  to  exhibit  groups  of  outliers. 

•  ( c )  Sample  t  races  t hat  demonstrate  st  rong  cycles  cannot  be  modeled  by  1  he  GSLM  since 
the  regression  functions  at  lag  (k)  i.e.  /-.’[.V, |.Y,_/t]  are  all  linear  due  to  the  assumed 
joint  normality. 
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•  (d)  I'he  Gaussian  process  {A,}  is  reversible  i.e.  ( Ah , . ,  Xtn  )'  has  the  same  distribution 

as  (  Xtn . V j )'.  Reversibility  is  violated  in  the  presence  of  differences  in  the  rate  at 

which  a  sample  path  rises  to  its  maxima,  and  the  rate  at  which  it  falls  away  from  it. 
One  simple  way  for  investigating  departures  from  reversibility  is  to  plot  the  sample  on 
a  transparency  and  then  turn  it  over.  If  the  mirror  image  is  similar  to  the  original  plot 
then  the  series  may  be  assumed  reversible  -  irreversible  otherwise. 

One  could  also  test  formally  for  Gaussianity  and  linearity.  Following  Brillinger  [6],  who 
pointed  out  to  the  potent  ial  of  using  the  Inspect  ral  density  function  as  the  basis  for  classifying 
a  process  as  linear  (and  possibly  Gaussian)  or  non-linear,  Subba  Rao  and  Gabr  [35]  and 
Hinnich  [13]  developed  formal  tests  for  linearity  and  Gaussianity.  The  tests  are  based  on 
the  constancy  of  the  normalized  bispectral  density  function  under  the  assumption  that  {A’(} 
have  a  linear  representation,  l  ong  [36])  provides  a  comprehensive  review  of  tests  for  linearity 
atid  normality.  Priestley  [29]  considers  the  case  where  a  stationary  process  does  not  fit.  into  a 
linear  representation  and  concludes  that  "a  fortiori  many  types  of  non-stationarv  processes 
would  also  fall  outside  the  domain  of  linear  models."  In  summary,  observed  time  series  do 
not  necessarily  conform  to  models  such  as  the  GSLM.  The  degree  to  which  a  time  series 
realization  represents  a  trace  generated  by  the  GSLM.  has  a  direct  bearing  on  the  usefulness 
< >1  estimating  an  ARMA(p,<|)  model.  For  purposes  ol  prediction,  forecasting  and  control  one 
is  better  off  taking  advantage  of  the  non-linear  (hence.  non-Gaussian)  structure  of  the  data 
•  luring  the  modeling  stage.  If  indeed  the  GSLM  is  deemed  inappropriate,  one  has  the  choice 
among  several  families  of  non  linear  models.  We  shall  turn  to  some  of  these  explicitly  in 
section  3. 
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3  AR(1)  Type  Exponential  Models  (EAR) 


llit'  family  of  models  wo  consider  here  is  that  of  the  t  spoilt  ntial  aulorttjrt  ssive  models,  which 
is  composed  of  the  KAR(  1 )  and  its  generalization  to  t  he  transposed  (spoilt  ntial  autori (jr< ssivt 
model  YFAR(  L),  and  the  newt  r  exponent  ial  autoregrssire  NKAR(  1)  model.  This  type  of  tim(' 
series  models  wore  proposed  by  Gaver  and  Lewis  [11],  Lawranee  and  Lewis  [20,  21],  Jacobs 
and  Lewis  [l  l],  Lawranee  [19]  and  further  developed  by  Lawranee  and  Lewis  [22,23,  2d].  Also 
we  consider  Robert  son's  Fixed  and  Random  models  [31],  and  the  Product  Autoregressive 
PAR(  1)  model  proposed  by  McKenzie  [27]  -  where  all  models  being  restricted  to  a  first  order 
autoregressive  struct  lire. 

In  contrast  with  other  non-linear  time  series  models  (e.g.  bilinear  and  threshold),  this  class 
of  models  is  an  attempt  to  capture  the  behavior  of,  possibly  observed,  time  series  processes 
with  explicit  marginal  exponential  distributions.  Thu  family  of  FAR  models  is  advocated 
as  a  way  of  relaxing  tin’  assumption  of  marginal  (iaussianity  which  underlies  the  Gaussian 
linear  stationary  model.  The  reasons  behind  the  choice  of  the  exponential  distribution  as 
1  he  marginal  distribution  are  given  in  Gaver- Lewis  [11]  and  Lawrame  and  Lewis  [23].  The 
standard  linear  first  order  autoregressive  process,  Ait(i),  with  exponential  input,  ARF(l), 
will  be  used  lor  comparison  purposes  in  section  5.  This  model  has  an  identical  correlation 
and  spectral  density  functions  as  do  the  models  mentioned  above  : however,  its  marginal 
distribution  is  not  known,  thus,  it  is  not  to  be  considered  as  an  exponential  model  but 
rather  as  a  linear  AR(  1  )  model  with  exponential  input.  The  fact  that  it  is  linear  enables  us 
to  distinguish  it  from  any  other  non-  linear  model,  with  or  without  an  identical  correlation 
structure,  based  on  the  theoretical  result  stating  that  a  process  with  a  linear  representation 
ha  a  flat  (constant )  normalized  Inspect ral  density,  for  more  details  see  Subba  Rao  and  Gabr 
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3.1  PAR(l)  Model 


A  natural  extension  ol  the  linear  A  11(1)  model  was  proposed  l>y  McKenzie  [27]  and  consists 
ol  an  exponentiation  of  the  linear  model  such  that  the  additive  form  is  being  transformed 
into  a  multiplicative  fo-tn.  Here  we  consider  a  sepcial  case  of  the  gamma  family  of  marginally 
distributed  time  series  where  the  output,  series  has  an  exponetial  marginal  distribution  ol 
unit  mean.  Specifically, 

v,  =  a;l,i,  (.hi) 

where  o  £  (0,  1)  and  is  given  by  a  mixture  ol  uniform  (0,  x)  and  exponential  mean  one 
random  variables  independent  of  each  other. 

1’his  model  differs  from  t he  others  we  consider  in  two  aspects.  First,  the  innovation  process 
does  not  posess  a  known  parametric  density  function  and  its  higher  order  cumulanf  structure 
is  expressed  in  terms  of  t  he  moments  of  A,  only.  Second,  we  note  that  C.i.  I  )  may  be  linearized 
by  taking  the  log;  of  both  sides  of  the  equation.  As  such  it  is  classified  as  an  intrinsically 
linear  model  i.e.  i  non  linear  model  which  can  be  linearized.  It  differs  from  the  following 
models  which  cannot  i  e  linearized  due  to  their  switching  nature  and  are  to  be  considered 
under  the  class  ol  intrinsically  non-linear  models  i.e.  a  non-linear  model  which  can  not  be 
linea  rized. 

3.2  EAR(l)  Model 

In  the  following  set  up  we  let  { E, }  be  a  sequence  of  i.i.d  exponential  (A)  random  variables 
with  a  probability  density  function  given  by 


/,■;(<) 


Ac 

0 


A<  <  >  0  .  A  >  0 

otherwise  . 
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We  define  an  EAR(l)  model  as. 


Xt  =  pXt- 1  +  £t 

pXt-\  with  prob.  p 

pXt-i  +  Et  with  prob.  (1  —  p) 

=  pXt~  i  T  IfEt 

with  (0  <  p  <  1)  and  {/(}  being  an  i.i.d  sequence  defined  by 

0  with  prob.  p 

1  with  prob.  1  —  p. 

Under  this  formulation  {.Yf}  is  marginally  distributed  as  an  exponential  random  variable 
with  parameter  A. 

Caver  and  Lewis  [11]  point  out  to  several  characteristics  of  the  EAR(l)  model: 

•  Setting  p  =  0  yields  the  special  case  where  {.Y(}  is  a  sequence  of  i.i.d  exponential 
random  variables. 

•  c(  is  not  a  continuous  random  variable.  This  feature  distinguishes  (3.5)  from  the  usual 
linear  AR(1)  equation  with  Gaussian  or  exponential  input. 

•  The  representation  (3.5)  is  one  of  u  random  linear  combination  of  an  i.i.d  exponential 
sequences;  thus,  can  be  easily  simulated  on  a  computer. 

One  problem  the  EAR(  1 )  model  has  is  called  'zero  defect,'  (see  Lawrance  and  Lewis  [22])  and 
relates  to  the  sample  paths  it  generates.  Specifically,  the  model  generates  paths  in  which 
large  values  are  followed  bv  runs  of  decreasing  values,  with  the  runs  having  geometrically 
distributed  lengths.  The  large  values  arise  when  Et  is  included  (i.e  It  —  I )  while  the  falling 
values  stem  from  the  deterministic  part  of  (3.5)  (i.e  It  =  0). 


(3.3) 

(3.4) 

(3.5) 
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3.3  TEAR(l)  Model 


A  natural  extension  of  the  EAH(l)  model  is  to  interchange  the  role  of  A’,_|  and  E,  in  (3..r>). 
This  does  not  affect  the  exponential  (A)  marginal  distribution  of  A,.  Upon  replacing  p  by 
1  —  a  we  obtain  the  transposed  exponential  autoregressive  TEAH.(1)  model 

V,  =  /«*,_,+ (l-<v)Et  (3.7) 

f  AVi  +  (1  -  a)Et  with  prob.  a 
1  (1  —  a)Et  with  prob.  1  —  a 


when' 


It  = 


0  with  prob.  1  —  a 
1  with  prob.  a. 


Note  that  in  this  case  the  innovation  process  is  a  continuous  random  variable  scaled  by 
a  constant  1  —  o.  The  behavior  of  a  simulated  path,  for  a  large  o,  shows  geometrically 
distributed  runs  of  rising  values  (i.e.  It  —  1)  followed  by  sharp  declines  when  the  selection 
1 1  -  0  is  made.  The  decline  due  to  the  exclusion  of  the  previous  value  .Y,_i- 
The  TEAR(l)  model  is  discussed  bv  Lawrance  and  Lewis  [22]  as  an  extension  of  the  EAR(l) 
model.  However,  TEAR(1 )  is  also  a  special  case  of  Arnold’s  [3]  exponential  model  driven  by 
past  innovations.  Specifically,  define  the  random  variables 

Nt  =  1  if  and  only  if  Ut  —  1 
St  =  i  if  and  only  if  Ut  =  0,  Vt-\  =  0  . . .  Ut-i+\  =  1 


where  (',  are  i.i.d  Bernoulli(p)  random  variables  with  Nt  being  distributed  identically  but 

not  independently  as  Geometric] a)  random  variables  with  domain  1,2,3,.... 

The  model,  expressed  in  terms  of  past  innovations,  is  given  by 

/v, 

-Aif  —  a  =<-1+1  (3.10) 

i=i 

where  ~  iid  Exp(A)  and  the  sum  is  multiplied  by  q  to  obtain  strict  stationarity.  This 
representation  is  obtained  if  one  express  the  TEAR(l)  model  (3.8)  recurssively. 


177 


3.4  NEAR(l)  Model 


The  previous  two  models,  EAR(l)  and  TEAR(l),  are  special  cases  of  a  more  flexible  model 
in  which  {A,_t}  in  (3.8)  is  scaled  by  a  coefficient  8",  thus,  simulated  realizations  generated 
by  such  model  are  ot  interest  as  it  may  circumvent  the  problem  of  geometrically  distributed 
runs  of  falling  or  increasing  values  which  might  not  be  applicable.  Specifically,  let  {.Y(} 
denote  the  time  series  variables  and  {/^}  be  a  sequence  of  an  i.i.d  unit  mean  exponential 
random  variables  acting  as  the  innovation  process.  The  NEAR(l)  model  is  defined  as 


■V,  =  e,  + 


8Xt- 1  with  prob.  a 


^  0  with  prob.  1  —  o 

=  TT  A(_i  +  st 


(3.11) 

(3.12) 


wnere 


{Et  with  prob.  p 
bEt  with  prob.  1  -  p 


with  l>  =  ( 1  —  n)8  and  p  — 


n  — 


i ->i 


0  with  prob.  I  —  a 
1  with  prob.  o 


(3.14) 


i_(i_rt)0-  The  parameters  o  and  f8  are  allowed  to  take  values 
over  the  domain  defined  by  0  <  o,  {i  <  1  with  a.  8  /  1.  Setting  (o  =  1  ,  0  <  ft  <  1)  in 

(3.12)  yields  the  EAR(l)  model,  where  fixing  (8  =  1  ,  U  <  a  <  1)  give  rise  to  the  TEAR(l) 
model.  Both  are  extreme  cases  of  a  NEAR(l)  process.  We  note  that  due  to  the  distribu¬ 
tional  assumption  underlying  { Et } ,  the  innovat  ion  process  is  not  allowed  to  take  on  negat  ive 
values  i.e.  !}[Et  <  0)  =  0.  It  is  obvious  how  the  concept  of  "switching"  comes  into  play  in 

(3.12) .  I  he  switch  from  one  linear  piece  to  the  other  is  controlled  by  an  external  random 
mechanism  with  a  prespecified  parametric  probabilistic  structure. 
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3.5  Robertson’s  Fixed  and  Random  models 

Robertson  [31]  suggested  two  exponential  models  which  we  shall  refer  to  as  Robertson's  fixed 
and  random  models.  Our  main  concern  is  to  show  that  these  models  cannot  be  identified 
via  the  correlation  or  spectral  density  functions  ;hence,  one  has  to  explore  the  higher  order 
cumulant  structure. 


3.5.1  The  Fixed  Model 


Consider  the  following  switching  structure 


J  ,Yt_|  —  In  ft  with  prob.  ft 
\  Ei  with  prob.  1  —  ft 


(3.15) 


vv 


here  ft  is  a  fixed  constant,  Et  has  a  truncated  exponential 


/fc'(t)  = 


~e  f  0  <  c  <  —in ft 
0  otherwise 


(3.16) 


with  the  marginal  distribution  of  Xt  being  exponential  with  unit  mean.  Alternatively,  (3.15) 
may  be  represented  using  an  indicator  random  variable  i.e. 

Y,  =  /((A'f-i  —  Inft)  +  (1  -  It)Et  (3.17) 


where 


{1  with  prob.  ft 
0  with  prob.  1  —  ft. 


(3.18) 


3.5.2  The  Random  model 


One  may  generalize  the  fixed  model  by  allowing  ft  to  become  a  random  variable  which  acts 
as  a  mixing  distribution,  with  domain  restricted  to  the  interval  [0 . 1  ] .  Specifically,  let  Xt  have 
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the  representation 


A,  = 


Xt-\  —  lniit  wit h  prol).  d. 

Et  with  prob.  1  -  dt 


or  stated  in  terms  of  an  indicator  random  variable 


A  /  —  /(( A(_i  —  htdt)  +  (1  —  U)Et 


(3.19) 


where 


It  = 


1  with  prob.  lit 
0  with  prob.  1  —  (it- 


(3.21) 


The  probability  density  assigned  to  ,it  is  a  beta  density  with  parameters  (a,  2) 


/a,  (A) 


o(o  +  1)(1  -  Li)d°~' 
0 


0  <  ;i  <  1  ,  o  >  0 
otherwise. 


The  distribution  of  lndt  is  obtained  using  the  standard  transformation  of  variables  technique. 
Let  )  =  ln.it  then 


/>•('/) 


o(o  +  1)(1  -  cy)caJ/ 
0 


—  cc  <  y  <  0 
otherwise. 


o  >  0 


I  he  probability  density  function  for  /')  is  appropriately  modified 


hi  0 


rrf  f  0  <  (  <  —lniit 
0  otherwise. 


(3.24) 


Within  this  framework  one  notices  that  the  random  variables  /,  and  Et  are  not  independent 
as  they  both  involve  the  mixing  distribution  dt.  The  marginal  distribution  of  Xt.  though, 
remains  exponential  with  unit  mean  by  construction.  We  remark  that  all  these  models  are 
stationary  in  the  wide  sense  i.e.  strictly  stationary. 


3.6  Summary 

We  recall  that  the  models  under  investigation  are  the  following  : 
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•  ARE(l)  : 


A  /  </'A/  i  ("  /'/ 


•  PAR(l)  : 


-v*  =  -Vjl.v; 


•  EAR(l)  : 


A, 


pXt~\  with  prob.  /> 

p A,_|  +  h\  with  pro!).  1  —  p 


•  TEAR(l)  :  (p  =  i  -a) 


-V, 


A',_i  +  (1  —  (y)E,  with  proh.  « 

(I  —  a)Et  with  prob.  !  —  a 


•  NEAR(l)  : 


whore. 


V, 


d.Y,_ 1  +  s,  with  prob.  rv 
st  with  prob.  I  -  o 


=  i 


with  prob.  ]> 
bEt  with  prob.  1  —  p 


b  =  ( !  -  a)fi 


•  Roberston’s  Fixed  Model  : 


V, 


A'(_i  —  hui  with  prob.  fi 
Et  with  prob.  1  —  fi 


whore 


//•;(< )  = 


0  <  t  <  -Infi 

0  otherwise 


•  Roberston’s  Random  Model  : 


A, 


A(_)  —  Infi,  with  prob.  fi, 

E,  with  prob.  1  —  fi, 


(3.26) 


(3.31) 


where 


//•;(  0 


r  0  <  (  <  —hi fit 

0  otherwise 


o( o  +  I )( 1  -  d)dl>_1  0  <  fi  <  1  ,  o  >  0 

0  otherwise. 
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Table  I:  Correlat  ion  Fund  ions 


ARK(  1 ) 

PAR(l) 

KAR(I) 

TKAR(l) 

NKAR(l) 

Robertson’s  Fixed 

Robertson’s  Random 

oa 

o" 

/>* 

a9 

(oiiy 

,r 

(  — )'s 
'  i»  +  2  1 

I 'or  a  ll  models,  but  Robertson's  and  PAR(I),  t  he  input  process  {/',)}  is  assumed  to  bo  an 
i.i.il  exponential  sequenceo!  unit  mean  and,  with  the  exception  of  ARK(l),  tin*  output  {A)} 
has  a  marginal  exponential  distribution  with  mean  one.  The  correlation  functions  for  the 
various  models  are  given  in  table  1. 

Figures  Id  contain  simulated  traces  produced  by  the  various  models.  Note  that  we  indexed 
the  parameter  values  of  each  model  such  that  the  correlation  functions  produce  identical 
results  i.e  p(.s)  =  (0. 1  )\  (0..r>)\  (0.7.r,)s. 

4  Higher  Order  Cumulants 

Let  {A,}  be  a  real  valued  strictly  stationary  random  process  and  let  mf/j,/-;,.  .  . ,  4 )  be  the 
-order  product  moment  i.e. 

'"(4-4 . Ik)  =  V[XtlXt.2 ...  ,VtJ.  (4.1) 

For  a  stationary  process  of  order  h\  we  can  write  (1.1)  as 

>»(l \->i . /*)  =  n)(0J2  -  . 4  -  /|).  (4.2) 

Now  let  t lit'  characteristic  function  (cf)  of  {.Y,}  be  defined  by 

o.vUVC* . U)  =  /•:[c,((-','V,,+0-V',+  Lu.A,t)]  (4.3) 
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then  the  lay  lor  series  expansion  of  (4.3)  about,  the  origin  is  given  by 

®.v  (0  =  /  | E  ^ ( ,V„  -v„  ■  ■  ■  )  +  0( I  C  l‘ )|  dF  (4.1) 

=  Y.  "  ^)  E(.V„X„  . . .  A,,]  +  0(K  I*)  H-5) 

J=  1  J' 

=  . 4)  +  0(K|k)  (4.6) 

j=  1  J' 

where  |  C  |=  {£f=1C?}2  and  F  =  FXt^xt2 . xh,- •  •  ,*<*)  being  the  joint  cumulative 

distribution  function  . 

The  logarithm  of  the  ef  (4.3)  is  defined  as  the  cumulant  generating  function  (cgf) 

A'.v (Ci '  C2 . 00  =  log{E[e'«'x‘i+<*x'i+' -+<*x‘0]}  (4.7) 

such  that  C(ti,t2 . F),  the  kth -order  joint  cumulant,  of  the  set  of  random  variables 

{A0,,A’i2 . Y(fc}.  ‘s  the  coefficient  of  (Ci,  (2,  •••,  Cfc)  in  the  Taylor  series  expansion  of  (4.7) 

about  the  origin.  Specifically 

F.v(C)  =  E  +  0(1  C  \k)  (4.8) 

j= 1  J- 

where  Oj  ( /  j .  t2 . i, )  =  C’umulant( Xtl ,  A’(2 , . . . ,  Xt} )  .  We  note  that  the  cumulant  of  order 

greater  than  two  are  all  zero  for  a  Gaussian  process.  This  feature  is  used  extensively  in  signal 
processing  to  suppress  Gaussian  noise. 

The  relationship  between  moments  and  cumulants  were  formalized  by  Leonov  and  Shiryaev 
[25]  and  are  given  by 

»»(/, . /,)  =  F[.V/,  .V,2  . . .  A,J  =  X;r'(<21)0(,/2)...0(//p)  (4.9) 

u 

where  the'  sum  is  taken  over  all  partitions  (."t . vp)  which  is  a  partition  of  (tt . F-)- 

Relationship  (4.9)  ini|>lies  that  we  can  write  the  moments  in  terms  of  the  cumulants  and  if 
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we  invert  (4.9)  then  one  can  write  the  cuniulants  in  terms  ol  the  corresponding  moments; 
hence,  the  inversion  of  (4.9)  yields 

C(.Y|,..V|, . .v,,) = d-t'i?  -  im  n  -  •  •  *■•(  n  jo>  («•"» 

"  j€^i 


and  if  the  process  is  A-"‘-order  stationary  then  we  may  write 

4(/,,/2,...,4)  =  r(o,/2-t, . 

=  C{TUT-2....,Tk- 1)- 


l'Voni  (4.10)  it  is  seen  that  the  cunmlant  C(r,,r2, - rfc_,)  is  a  kth -order  polynomial  in  the 

moments  of  no  higher  than  k  and  conversely,  the  /(-"‘-order  moment  rn(tut2 , - 4)  is  a 

A-'* -order  polynomial  in  cumnlants  of  order  no  higher  than  k.  Consider  the  specific  cases  : 


4,(0) 

4a(.s) 


£[.*,]  =  fir 
p(.s)  -  n2r 

p(s,..S.2) 

{//(S|)  +  //(•SJ2  )  +  /i(>s2  -  -sl)}Px  +  2/i3 


-  //j  {//(s2  —  •'*  1 i  ■'*3  -  •s'l)  +  /^(^>2’  •ts)  +  //  ( -S  , .  -**2  )  +  //(.«!,  -Al)} 

+  2//j.{//(s, )  +  //(.s-2)  4-  /<(.<3)  +  p(s2  -  .s , )  +  //(s3  —  s, )  +  p(s3  —  £2)} 

-  f<(»l  )/'(-s3  -  *-*2 )  -  /i(.S2)//(.s3  -  *,)  -  tl{Sz)n(s 2  ~  Si)  - 

where 


/'(■s) 

2) 

//(.s,..s2.S,) 


4[.v(.v(+,! 

A’[A^Yf+Sl  A<+.,2] 

Zt.  [  A ,  A  ^  5  j  A ,  A  f, , .  ] 
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Consequent  ly,  one  may  write  .  .s2)  in  the  form 


C(r,s)  =  £*[(A't-/ix)(A'f+r-/ix)(.Vt+, -/i,)]  (4.11) 

=  E[XtXt  +  rXt  +  li)  ~  /*r[/'’(  A<  A(  +  r)  +  E(XtXi+s)  +  £(At+r  Af+s)]  +  2/Z^ 

and  t  4(.Sj ,  .S2,  s.()  may  be  expressed  as 

C(r,s.u)  =  E{XtXt+rXt+sXl+ll)  (4.12) 

—  /iJ.(/,y(.\/.V|+s.fAt+u.f]  +  E\XtXt+sXl+n\  +  E[XtXt+rXt+s]  +  E[XtXt+TXt+u]) 

+  E  [  Xt  A  >+,•]  +  E[Xt  A(fS]  +  T[A<Af+u] 

+  £[Af  At+S_ r]  +  A[A(A(+U_r]  +  E[X,  A)+U_sj) 

-  E[X,Xl+r]E[XtXt+u-,}  -  E[XtXt+.]E[XtXl+u-r]  -  E[XtXt+u)E[XtXt+s-r}  -  6/4- 

For  a  ch'tailed  account  of  the  relations  bet  ween  moments  and  cumulants  the  reader  is  advised 
to  consult  Kendall.  St  uart  and  Ord  [16].  Cumulants  and  their  relationship  to  spectral  analysis 
are  discussed  by  Sesay  [24]  and  Rosenblatt.  [32].  Sesay  [34]  discusses  the  various  uses  of 
cumulants  and  cumulants  spectra,  specifically 

•  Cumulant  spectra  is  used  in  tests  aimed  at  discriminating  between  linear  and  non-linear 
non -Gaussian  processes  (see  Subba  Rao  and  Gabr  [35]). 

•  The  asymptotic. distributions  in  some  non-linear  theory  may  be  obtained  using  cumu¬ 
lants. 

•  Time  reversibility  may  be  determined  by  verifying  C(—  .Sj . — ,sfc_ j )  =  C(sj , .  .  .  ,  s*._i ) 

or  equivalently  the  imaginary  part  of  the  -order  spectrum  is  equal  to  zero. 

•  ( ’ross-rumulants.  and  cross-cumulant  spectra,  can  be  used  in  the  estimation  of  the 
parameters  of  a  non-linear  difference  equation  through  the  use  of  transfer  functions 
that  arise  in  the  Yolterra  expansion  (see  Priestley  [30]). 
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5  The  3r</- Order  Cumulant  Structure 


In  the  following  we  shall  present  the  3r,i-order  emnnlant  structure  for  each  of  the  models 
discussed  in  section  3.  For  each  of  the  models  a  closed  form  solution  to  the  3r,,-order  cumulant 
structure  is  given.  These  solutions  are  based  on  closed  form  expressions  obtained  for  the 
expectation  terms  which  define  the  3rd-ordor  cumulant  structure.  For  all  -  but  ARE(  I )  model 
the  output  process  is  marginally  distributed  as  an  exponential  process  with  unit  mean.  The 
results  presented  in  this  section  are  based  on  the  marginal  moments  given  by 

f>r  =  1  fhj  =  2  fljr-i  =  6 

The  input  process  is  taken  as  an  i.i.d  exponential  process  with  unit  mean;  hence,  with  identi¬ 
cal  moments  as  stated  above.  Robertson's  and  RAR(l)  models  form  a  separate  class,  in  this 
respect,  since  the  innovation  process  is  defined  by  a  sequence  of  i.i.d  truncated  exponential 
random  variables  and  a  mixture  of  exponetial  and  uniform  random  variables,  respectively. 
The  introduction  of  a  mixing  distribution  in  Robertson's  random  model  further  complicates 
t  he  st  met  me  of  t  he  innovation  process.  Tables  2,  3  and  1  list  the  3r,i-order  cumulant  structure 
for  those  models.  Wo  recall  that  the  models  under  investigation  arc1  given  by  (3.25 )-( 3.3 1 ). 

The  following  expressions  are  used  in  labels  2.  3,  4,  5  and  b 

I 

I  —  0 

■> 

(1  —  0l )(  1  —  0) 

6 

(T  -o;i)fl  -o2)(l  -6) 

1  -  cv 
I  —  o 


1  -  oiT 

1  -  o 


V  r 

fir.  2 

th-.A 
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d 

l(r) 

o 

h  [ln.il,] 
r:[(in.-iri;] 
t-Vhh] 


-  A’[.V"]  =  l'(l  +  o)  ,  o  e  (0.  I) 
=  />  +  (!-  I>)l> 

=  An  +  ( i  -  />)/>2] 


i  -  .v 


1  - 

:i 

1  - 

(o.iV 

1  - 

-  n.i 

1  - 

y 

1  - 

T 

=  n.i- 

=  /•/[ In 

Ah\- 

=  E[ln 

At;  1 

=  /'-'[(/ 

nAft 

=  H[X 

oVlj 

1  - 

(f 

1  - 

0 

o 

n  + 

■) 

-  /•;[/» 

M;\  = 

=  m 

n.i)2  If 

-  m 

A;]  = 

;i;\  = 

!  +  a 


(1  +a) 


1  •  £[*?/«] 


(n  +  2)2  (o  +  I  )2 

r  '>  ■> 


(o  +  l)'  (o  +  2)5 


o  +  I  o+2  (o  +  2)2 


o  +1  o  +  2  (o  +  2)2  (a  +  2):i 


(liven  the  inlormat  ion  summarized  in  these  tables  one  may  standardize  the  rate  of  decay 
of  the  correlation  function  such  that  the  correlation  functions  are  identical  for  these  models 
for  a  given  parameter  value.  Our  goal  is  to  investigate  how  would  the  d'^-order  cumulant 
structure  behave  subject  to  a  standardized  correlation  function.  It  is  our  conjecture  that 
one  might  be  able  to  discriminate  among  signal  paths  produced  by  the'  various  models  on 
the  basis  ol  higher  order  moments.  It  is  obvious  that  the  correlation  functions  can  not  be 
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Table  2:  3rrf-Order  Cumulant  Structure  :  The  Linear  Model 


ARE(l) 

C(0,0) 

/^x,3  3/ij., 2^x  “l”  2/ij. 

C(0, T ) 

4*  [/^x,3  2//Xf2/^x] 

+[7(r)  -  fix][fiXt 2  -  2 fi2x\ 

C(r,  r) 

<t>iTl*x, 3  +  2^,2 [7 (2r )  -  7(r)] 

+2/ir[72(r)  -  /xx7(r)] 

+  -  V-'liT)}} 

~ ^x.i^xfl  +  2  4>T]  +  2^1 

C(  1,1  +  r ) 

0rK2/*x, 3  +[#x, 2  T  /i*][2 (pT  +  7(7-)] 

-Mx[Mx,  2{<Pt  +  «‘>r+1  +  <t> }  +Mx{7(t)  +  7(t  +  0  +  1}] 

+2/C 

C  (A,A  +  r) 

f+2Vx,3 

+2<?!)r[pr,2{7(2/?)  -  7(/i)}  +/i*{72(^)  +  {72(A)  -  ^"^(A)}}] 

+Lh,2(t>h/y{T)  +/zr7(/i)7(r) 

-/Tr[/ix,2{^r  +  <t>TJrh  +  4>h}  +Vx{l  (t)  +  7(r  +  A)  +  7(A)}]  +  2/i® 

Table  3:  3r,/-()rder  Cumulant  Structure  :  The  Intrinsically  Linear  Model 


PAR(l) 

2 

C(0, T ) 

*‘2  ar  [Ax,nrr-f-2  2Hx,a 

r+l] 

C(r,  r) 

,2ar  +  l  .■)Itx,nr-fl 

M2.2r,r  “  Mi.a’' 

C(l,l  +  r) 

Or  -+  1  )+  1  ^  T,C»r  +  1  J 

/Tjr,r>r  Mx,a(ar  +  1)  { 

jniHramjBiiillifl 

C’(ft,  ft  -f  t) 

fleBHraraiEmiil 

used  as  a  tool  for  discrimination  purposes  and  consequently  nor  can  the  spectral  densities. 

10  illustrate  the  shape  of  the  3rii-order  cumulant  structure,  see  figure  4,  we  set 
<p.  p.  o,  li.  oO.  ^2  =  0-5.  First,  we  observe  that  certain  ratios  in  tables  2,3,4  and  5  yield  a 
clear  characterization  of  the  cumulant  surfaces.  Consider  the  ratios,  presented  in  table  6,  for 
the  models  with  a  simple  close  form  i.e.  EAR(l),  TEAR(  1 )  and  Robertson’s  fixed  model. 

W  Idle  such  simple  expressions  are  not  available  for  the  remaining  models  it  is  possible  to 
investigate  the  behavior  of  these  ratios  numerically.  Two  of  the  above  ratios  turn  out  to  be 
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Table  4:  3rrf-0rder  Cumulant  Structure  :  The  Int  rinsically  Non-linear  Models 


EAR(l) 

TEAR(l) 

NEAR(l) 

C(0,0) 

2 

2 

2 

2  PT 

2aT 

2(^r 

C(r.r) 

2  Qr[l  +r(l  —  «)] 

6AT 

+4(a/?)r[p,7 p(r)  -  1] 

+2//f  [//£a/3  j  2a0  {  7a(  T ) 
-(o/9)t-17/3(t)}  -  7 0o{t)\ 

+M£,27a(t) 

('(M  +  r) 

2  pT+2 

2oT+1(2  -  o) 

2(a/?)T+l[3/?  +  2/rt] 

+(a/3)T/tz,,  2 
+7ap(^)/b[/6  +  2a/?] 

-[2{(a/?)r(l  +  a/?)  +  a/?} 

+  M7atf(r) 

+  7  afi{T  +  1  )  +  1  }]  +  2 

C(hJl  +  T) 

2 pT+2h 

2aT+fl[\  +h{  1  -  a)] 

6(al1)T\h 

+4(a,tf)T+V<7/?(M 

+2(al3)hfj,t^ais(T ) 

+2(Q^)T+V?,_la/j{7A(A) 

+{a'3)Tnu  27a(A) 

-[2{(a/?)T(l  +  (ottf)h)  +  (a/?)/l} 
+/uf{7  apM  +7  opir  +  h) 
+7<vp(^)}]  +  2 

more  informative  for  the  purpose  of  discriminating  among  the  models  :  and  . 

In  the  simulation  context,  however,  since  the  cumulant  surfaces  decay  rapidly  towards  0,  the 
computation  of  these  ratios  become  difficult  as  we  attempt  to  divide  by  very  small  values  . 
Those  numerical  considerations  unstabilize  the  use  of  the  ratios  as  a  tool  for  discriminating 
among  the  models.  The  computed  ratios  (  as  functions  of  the  lag  r  ),  indexed  by  a  set 
of  parameter  values  such  that  the  correlation  function  of  each  model  exhibits  an  identical 
behavior  (e.g.  p(.s)  =  (0.5)5)  are  also  given,  figures  5-6.  so  to  demonstrate  the  shapes  of  the 
expressions  given  in  the  first  and  fourth  rows  of  table  6. 

(liven  the  plots  of  the  ratios  and  the  cumulant  surfaces  for  the  six  models  we  may  classify 
them  into  three  categories.  KAR(l)  forms  its  own  class.  Robertson's  models  and  TEAR(l) 
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Table1  o:  Onlrr  Cumulant  Sl.nicl.mc  :  The1  intrinsically  Non  linear  Mcxk'ls  (coni.) 


Robertson’s  Fixed  Model 

Robertson’s  Random  Model 

C(0,0) 

2 

•> 

C(O.r) 

2;V 

2£>r 

C(r,r) 

2dr(l  -  Tlnii) 

2t>r[l  -  2t 

+  («  +  2)e//>,.[7(r  -  1)  -  (r  -  l){d_l] 

+  7(r)[2e/  +  e-] 

nu  +  t) 

2Ar+l(  1  -  In  ft) 

ier  +  i 

-£/[2(2/>,  +  1 )  -  rj 
+«h(^)  +  i(t  +  i)  +  i] 

-<i<h(T)  -  2(e-  i) 

t  '(h.h  +  r ) 

2JT+,1(  1  -  hlnj) 

—2gT+l'~  1  [2libr  +  «] 

-2(}h[<n(T  -  1)  +  1] 

+£>r[(o  +  2)<tbr{~f(h  —  1 )  —  ( /*  —  1  }  -  2] 

+7(//)(u27(r)  +  cffr] 

+«[7( T )  +  -)(t  +  li)  +  7(/0]  +  2 

form  a  separate  group.  NKAlt(l)  and  PAR(l)  form  an  additional  class.  Note  that,  the 
cm  nu  la  lit  surface  produced  by  N  KAR(  1 )  is  a  combination  of  KA  R(  1 )  and  T  Id  A  R( 1 )  and  that 
it  looks  very  much  like'  the  surface  produced  by  PAR(l).  However,  the  two  models  seem 
to  differ  in  their  behavior  when  one  observe1  the1  plots  of  the  theoretical  ratios.  Closer  look 
at  the  ve'rtical  axis  for  NKAR(I)  and  PAR(I)  in  figures  5  and  6  shows  that  the  range's  are* 
similar  anel  much  smaller  than  the  range's  of  the  vertical  axis  for  the  othe'r  models. 

6  Methodology 

In  the-  following  we>  propose  a  discrimination  procedure  that  may  be>  applied  to  the  models 
uneler  investigation  (d.25)-(T.'H )  or  to  any  se-t  of  competing  modeds. 

Let 


M=  {  a  finite  set  e)f  fin  it  parameter  models  }. 
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Table  6:  Ratios  of  the  :r,/-()r<ler  Cuniulant  Structure 


EAR(l) 

TEAR(l) 

Robertson’s  Fixed  Model 

n  0.r) 

PT 

1  +  ( 1  —  o)t 

1  —  r/n/A 

nti+t] 

C’(0,r) 

P2 

o(  2  —  o ) 

P{  1  -  lnf3) 

C(h,h  +  r) 
C(0,r) 

P2h 

afc[l  +  h(  1  —  a)] 

fih(l  —  hlnfi) 

C(1,1  +  t) 
C(t,t) 

P2-T 

a  ('2  — or) 
l  +  -(l-o) 

0(l -ln(j) 

\  —  rlnfi 

C'(h,h+r) 

C(r,r) 

p2h~T 

e.A[l  +  A(J-a)] 

1  +  dl-a) 

0h(\-hlnp) 

1  —rlnfi 

C[h,h+r) 

pAh-i) 

oth  1  [l+/i(l  —  a)] 

/th-1(l -hlnf)) 

nu+A 

2— a 

1  —  Infi 

Our  objective  is  to  identify  the  most  compatible  model  m  £  M  with  Specifically, 

given  find  a  model  m  £  M  such  that  m  ~ 

Procedure  : 

1.  Compute  Cx(u\ - ,uk)  .  k  =  0,1,2,...,  u,  £  /  integer.  We  call  it  the  empirical 

A'^-order  cumulant  structure  based  on  the  data 

2.  f  or  each  m  £  M 

(a)  Estimate  ,  using  {A^}”=1,  the  parameter  9m  (possibly  a  vector)  for  model  m. 

(b)  Compute  Cgm(u i, . . . ,  uk)  for  model  m  empirically  or  using  the  theoretical  cumu¬ 
lant  structure.  We  shall  call  it.  Method  1  if  the  computation  of  the  cumulant 
structure  is  done  using  the  known  theoretical  cumulant  structure.  We  shall  call 
it  Method  2  if  the  computation  of  the  cumulant  structure  is  done  empirically 
based  on  {  A" , } t . 

3.  (liven  the  above  quantities  we  seek  to  minimize,  for  a  norm  ||  || 

Minme,vf  ||  Cx (?/,, - uh)  -  C0m(uu. . .  ,uk)  ||  .  (6.1) 

Alternatively, 

Min„,e  \f  ||  fx( Aj . At)  —  fgm(  A] . A^)  ||  (6.2) 
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where  fgm  (Aj , . . . ,  A*)  is  the  fcth- order  spectrum  (i.e.  polyspectrum).  The  general 
distance  measure  may  be  specified  as  e.g. 

II S  ll=  E  l»l2' 

(».)es 

There  are  several  issues  that  need  to  be  considered  under  the  proposed  procedure.  First, 
various  properties  of  the  model  such  as  stationarity,  ergodicity,  moment  conditions,  moment 
calculations,  parameter  estimation  and  simulation  aspects  of  sample  traces  must  be  investi¬ 
gated.  Second,  statistical  properties  of  the  formal  test  statistics  based  on  (6.1)  or  (6.2)  have 
to  be  studied.  In  order  to  do  so  the  sampling  properties  of  the  proposed  procedure  must  be 
investigated.  In  the  following  section  we  consider  the  simulation  aspects  of  (6.1 )  and  present 
some  simulation  results  for  both  methods  1  and  2. 

7  Simulation  Results 

In  order  to  verify  the  possibility  of  discrimating  among  the  various  models  on  the  basis  of 
their  respective  3r<i-order  cumulant  surfaces,  it  is  necessary  to  obtain  reasonable  agreements 
among  the  theoretical  and  simulated  cumulants.  In  the  following  we  discuss  issues  related 
to  the  simulation  aspects  of  the  sample  traces,  correlation  functions,  3rd-order  cumulant 
surfaces  and  ratios. 

7.1  Simulating  Sample  Traces 

The  simulation  aspects  of  the  NEAR(l)  model  and  its  special  rases,  EAR(l)  and  TEAR(l), 
were  considered  by  Lawrance  and  Lewis  [20],  The  algorithm  they  give  is  being  used  in  our 
simulation  to  generate  sample  realizations  for  the  NEAR(l)  family.  The  subcases,  EAR(l) 
and  TEAR(l),  are  simulated  by  setting  (o  =  0.99  ,  0  <  ,3  <  1 )  and  (/?  =  0.99  ,  0  <  o  <  1) 
respectively,  in  the  same  program  that  generates  the  simulated  paths  for  NEAR(l)  model. 
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We  follow  Lawrance  and  Lewis  in  setting  the  degenerate  parameters  to  0.99  so  to  avoid 
complications  in  the  simulation  of  the  traces. 

Robertson’s  fixed  and  random  models  are  generated  by  two  different  programs.  One  which 
allows  a  selection  of  a  branch  with  a  fixed  probability  and  one  which  allows  the  selection 
of  a  branch  with  a  random  probability  generated  according  to  a  beta  random  variable  with 
parameters  (a,  2).  The  input  signal  is  a  truncated  exponential;  hence,  needs  to  be  simu¬ 
lated  accordingly.  Since  no  1MSL  subroutine  is  available  we  generate  a  realization  from  a 
truncated  exponential  random  variable  using  the  cumulative  distribution  function  technique. 
Realizations  from  the  AR(1)  model  are  easily  simulated  and  no  further  explanations  are  re¬ 
quired.  McKenzie  [27]  discusses  the  simulation  of  PAR(l)  models.  The  innovation  process 
Vt  is  generated  according  to 

V  =  E'-nb(l 7) 

where  U  is  distributed  as  a  uniform  (0,  tt)  sequence  of  random  variables  which  is  independent 
of  E  -  a  sequence  of  exponential  mean  one  random  variables.  The  function  b  is  defined  by 

b(o)  —  sin<j>(sinat4>)~Q  (sin(  1  — 

Thus,  { V }  is  generated  as  a  mixture  of  uniform  and  exponential  sequences  of  independent 
random  variables. 

All  the  simulated  paths  are  generated  by  FORTRAN  programs  that  call  IMSL  subroutines 
which  are  used  to  simulate  continous  uniform,  beta  and  exponential  realizations. 

7.2  Simulating  Higher  Order  Moments 

One  FOR  TRAN  program  is  employed  in  simulating  the  correlation  functions,  3rd-order  cu- 
mulant  surfaces  and  certain  slices  of  these  surfaces  .  Smoothing  considerations  lead  us  to 
simulate  each  model  30  times  where  the  length  of  each  simulated  trace  is  1010  data  points. 
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Table  7:  Distance  Measure  (6.1)  -  /?(s)  =  (0.25) 


PAR(l) 

EAR(l) 

TEAR(l) 

Robertson’s  Random 

PAR(X) 

0.26 

0.21 

0.17 

0.06 

0.58 

0.40 

EAR(l) 

0.018 

0.015 

0.68 

0.08 

1.40 

1.14 

TEAR(l) 

0.78 

0.70 

0.018 

0.37 

0.08 

0.03 

NEAR(l) 

0.12 

0.09 

0.31 

0.008 

0.84 

0.63 

Robertson's  Fixed 

1.80 

1.65 

0.22 

1.06 

0.02 

0.07 

Robertson’s  Random 

0.73 

0.66 

0.02 

0.34 

0.12 

0.05 

The  program  computes  two  expectation  terms  :  £[A^Yt+r],  over  the  range  of  lags  0  to  9,  and 
FJ[XtXt+r .Yf+r+s],  over  the  range  of  lags,  -9  to  -9.  Then  the  smoothed  empirical  correlation 
function  and  the  smoothed  3rd-order  cumulant  surface  are  computed  using  their  definitions. 
In  the  computations  of  the  expectation  terms  we  use  : 

i  1001 

-  Tolo  £ 


1  1001 

E[A(A(+rYt+7-+s]  =  Xj  XtXt+TXt+r+s 

In  order  to  determine  how  accuratly  the  simulated  cumulant  surfaces  match  their  theoretical 
couterparts  we  plot  the  empirical  correlation  functions,  the  empirical  C(r,  r)  slice  and  the 
complete  simulated  surfaces  in  figures  7-9.  This  is  done  for  various  parameter  values  and 
shown  for  those  that  correspond  to  p{s)  =  (0.5)s. 

7.3  Discrimination  Procedure  :  Method  1 

The  results  of  the  simulation  study  are  summarized  in  tables  7-12.  Tables  7-9  are  examples 
of  typical  values  obtained  by  a  single  run  of  the  simulation.  Tables  10-12  provide  the  propor¬ 
tions  of  correct  model  identification  out  of  30  repet  itions.  Note  that  in  table  7  the  diagonal 
line  contains  the  minimum  values  of  rows  2-5.  This  is  precisely  how  we  would  expect  the 
procedure  to  perform  for  any  parameter  value  indexing  a  standardized  correlation  function. 
However,  errors  occure  at  the  first  and  last  rows  where  the  method  fails  to  select  the  correct 
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Table  8:  Distance  Measure  (6.1)  :  p(s)  =  (0.5)s 


PAR(l) 

EAR(l) 

TEAR(l) 

NEAR(l) 

Robertson's  Fixed 

Robertson’s  Random 

PAR(l) 

0.67 

0.41 

0.74 

0.02 

1.44 

1.43 

EAR(l) 

0.07 

0.01 

2.24 

0.30 

3.34 

3.42 

TEAR(l) 

3.59 

2.97 

0.085 

1.39 

0.087 

0.10 

NEAR(l) 

0.61 

0.37 

0.93 

0.005 

1.67 

1.69 

Robertson's  Fixed 

4.92 

4.27 

0.32 

2.36 

0.06 

0.13 

Robertson’s  Random 

2.36 

1.91 

0.07 

0.76 

0.26 

0.32 

Table  9:  Distance  Measure  (6.1)  :  p(s)  =  (0.75)s 


PAR(l) 

EAR(l) 

NEAR(l) 

Robertson’s  Fixed 

Robertson’s  Random 

PAR(l) 

1.91 

1.12 

2.28 

0.03 

3.07 

3.02 

EAR(l) 

1.82 

0.02 

6.53 

0.85 

7.79 

7.69 

TEAR(l) 

4.14 

.3.85 

0.46 

1.25 

0.79 

0.93 

NEAR(l) 

1.83 

0.63 

3.26 

0.09 

4.19 

4.15 

Robertson's  Fixed 

mBEMM 

0.31 

1.63 

0.57 

0.70 

Robertson's  Random 

■  ■ 

BDSS3H 

0.51 

5.32 

0.24 

0.36 

model.  The  PAR(l)  model  is  being  identified  as  a  NEAR(l)  model  and  Robertson’s  Random 
model  is  being  identified  as  a  TEAR(l)  model.  The  theoretical  plots  of  the  3rrf-order  cumu- 
lant  structure  support  this  confusion  as  they  show  that  these  models  produce  very  similar 
surfaces  that  are  hard  to  distinguish.  In  table  8  we  note  that  the  procedure  fails  again 
to  select  PAR(l)  and  Roberson's  random  models.  Errors  occur  at  the  first  and  last  two 
rows  of  table  9  where  the  procedure  fails  to  distinguish  PAR(l),  the  fix  and  random  mod¬ 
els.  'The  incorrect  selection  that  appears  in  the  above  tables  is  consistent  with  our  previous 
remark  regarding  the  grouping  of  the  models  into  three  categories.  Robertson’s  models  and 
I  EAR(  1)  were  identified  as  sharing  a  very  similar  3rti-order  cumulant  structure  and  so  were 
PAR(  1 )  and  TEAR)  1 ).  Thus,  one  would  expect  to  have  difficulties  in  discriminating  among 
models  that  belong  to  the  same  family.  The  pattern  established  in  the  previous  tables  is 
consistent  in  the  30  repetitions  we  consider  in  tables  10-12.  PAR(l)  is  consistently  confused 
with  NEAR(  I  ).  and  I  EAR(l)  and  Robertson's  models  stand  out  as  a  separate  class.  The 
random  model  is  by  large  the  hardest  to  identify  and  typically  is  mistaken  for  TEAR(l) 
model.  Although  the  procedure  is  successful  in  identifying  TEAR(l)  and  the  fixed  model  it 
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Table  10:  Proportions  of  Correct  Identification  :  p(s)  =  (0.25)* 


PAR(l) 

EAR(l) 

TEAR(l) 

NEAR(l) 

Robertson’s  Fixed 

Robertson’s  Random 

PAR(l) 

0.0 

0.0 

0.0 

1.0 

0.0 

0.0 

EAR(l) 

0.03 

0.97 

0.0 

0.0 

0.0 

0.0 

TEAR(l) 

0.0 

0.0 

1.0 

0.0 

0.0 

0.0 

NEAR.(l) 

0.0 

0.0 

0.0 

1.0 

0.0 

0.0 

Robertson’s  Fixed 

0.0 

0.0 

0.0 

0.0 

0.73 

0.27 

Robertson's  Random 

0.0 

0.0 

0.7 

0.0 

0.03 

0.27 

Table  11:  Proportions  of  Correct  Identification  :  p(s)  =  (0.5)* 


PAR(l) 

EAR(l) 

TEAR(l) 

NEAR(l) 

Robertson's  Fixed 

Robertson’s  Random 

PAR(l) 

0.0 

0.0 

0.0 

1.0 

0.0 

0.0 

EAR(l) 

0.0 

1.0 

0.0 

0.0 

0.0 

0.0 

TEAR(l) 

0.0 

0.0 

0.7 

0.03 

0.27 

0.0 

NEAR(l) 

0.0 

0.0 

0.0 

1.0 

0.0 

0.0 

Robertson’s  Fixed 

0.0 

0.0 

0.17 

0.0 

0.83 

0.0 

Robertson's  Random 

0.0 

0.0 

0.63 

0.0 

0.37 

0.0 

Table  12:  Proportions  of  Correct  Identification  :  p(.s)  —  (0.75)* 


EAR(l) 

TEAR(l) 

NEAR(l) 

Robertson's  Fixed 

Robertson’s  Random 

PAR(l) 

0.0 

0.0 

0.0 

1.0 

0.0 

0.0 

EAR.(l) 

0.0 

1.0 

0.0 

0.0 

0.0 

0.0 

TEAR(l) 

0.0 

0.0 

0.67 

0.07 

0.26 

0.0 

NEAR(l) 

0.0 

0.0 

0.0 

1.0 

0.0 

0.0 

Robertson’s  Fixed 

0.0 

0.0 

EiK&aEl 

0.0 

0.53 

0.0 

Robertson's  Random 

_ M _ 

0.0 

0.53 

0.0 

0.47 

0.0 
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Table  13:  Proportions  of  Correct  Identification  :  p(.s)  =  (0.25)5 


TEAR(l) 

Robertson’s  Fixed 

Robertson’s  Random 

TEAR(l) 

0.57 

0.03 

0.4 

Robertson’s  Fixed 

0.13 

0.87 

0.0 

Robertson’s  Random 

0.17 

0.1 

0.73 

Table  11:  Proportions  of  Correct  Identification  :  p(. s)  —  (0.5),s 


TEAR(l) 

Robertson’s  Fixed 

Robertson’s  Random 

TEAR(l) 

0.7 

0.17 

0.13 

Robertson’s  Fixed 

0.27 

0.46 

0.27 

Robertson’s  Random 

0.3 

0.5 

0.2 

is  the  confusion  in  selecting  the  random  model  that  makes  it  difficult  to  judge  the  aduquacy 
of  TKAR(l)  or  the  fixed  model.  However,  since  the  three  models  share  very  similar  traces 
and  3";-order  eumulant  structure  one  may  choose  to  accept  each  of  the  three  as  compatible 
wit  h  any  of  t  hat  group. 

To  remedy  this  problem  we  may  apply  the  proposed  discrimination  procedure  to  the  '\rJ- 
order  eumulant  structure  for  those  three  models.  One  may  argue  that  since  the  models  share 
an  identical  2rJ-order  moment  structure  and  a  similar  3rd-order  eumulant  structure  (but  too 
similar  so  their  differences  can  not  be  captured  by  (6.1)),  then  it  might  be  possible  to  reveal 
their  true  identity  through  the  use  of  the  4r<i-order  eumulant  structure.  'Tables  13-15  contain 
the  results  of  the  simulation  study  applied  to  the  4rrf-order  eumulant  structure  of  TEAR(l) 
and  Robertson's  models.  The  choice  among  the  models  is  not  clear  cut  as  the  proportions 


Table  15:  Proportions  of  Correct  Identification  :  p(.s)  =  (0.75)s 


TEAR(l) 

Robertson’s  Fixed 

Robertson’s  Random 

TEAR(l) 

0.63 

0.3 

0.07 

Robertson’s  Fixed 

0.13 

0.57 

0.0 

Robertson’s  Random 

0.17 

0.47 

0.06 

197 


Table  16:  Proportions  of  Correct  Identification  :  />(. s)  =  (0.25)* 


ARE(l) 

PAR(l) 

EAR(l) 

TEAR(l) 

NEAR(l) 

Rob's  Fixed 

Rob’s  Random 

ARE(l) 

1.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

PAR(l) 

0.0 

■EBB. 

0.0 

0.0 

0.27 

0.0 

0.0 

EAR(l) 

0.0 

— 

0.93 

0.0 

0.07 

0.0 

0.0 

TEAR(l) 

0.0 

0.0 

0.0 

0.63 

0.0 

0.03 

0.33 

NEAR(X) 

0.0 

0.3 

0.03 

0.0 

0.67 

0.0 

0.0 

Rob’s  Fixed 

0.0 

0.0 

0.0 

0.07 

0.0 

0.67 

0.26 

Rob’s  Random 

0.0 

0.0 

0.0 

0.3 

0.0 

0.23 

0.47 

of  correct  identification  are  not  large  enough  to  enable  a  reasonable  degree  of  discrimination 
power  among  the  three  competing  models.  This  result  was  expected  to  hold  given  the 
theoretical  expressions  as  expressed  through  the  plots  for  the  theoretical  T^-order  cumulant 
structure,  figure  10.  In  these  plots  the  models  are  shown  to  produce  similar  behavior  at 
various  frames  of  C’(r,s,u);  thus,  there  is  no  reason  to  expect  a  high  degree  of  discriminat  ion 
power  among  the  models  on  the  basis  of  the  proposed  procedure  and  the  4ri-order  cumulant 
structure. 


7.4  Discrimination  Procedure  :  Method  2 

In  tables  16-18  we  provide  the  results  of  our  simulation  study  according  to  (6.1)  based  on 
the  empirical  cumulant  structure  only.  Note  that  we  addl'd  ARF(l)  for  comparison  pur¬ 
poses.  Since  the  marginal  moments  of  ARE(l)  are  different  from  the  remaining  models  we 
standarize  its  mean  to  equal  one  so  the  mean  of  the  exponential  innovation  process  becomes 
I  -  O.  I  he  higher  order  moments  are  not  standarized  to  equal  those  of  the  exponential 
models.  I  he  results  in  tallies  16-18  are  by  large  consistent  with  t  he  results  obtained 

under  tin'  previous  method.  The  main  difference  appears  to  be  in  the  improved  separation 
between  PAR(  I )  and  NKAK(  1 )  under  the  second  method  while  under  the  first  method,  which 
involved  the  theoretical  cumulant  structure.  PAR(l)  is  consistantly  mistaken  for  NEAR(l). 
We  use  method  2  with  the  P^-order  empirical  cumulant  structure  for  TFAR(l)  and  Robert¬ 
son's  models.  I  lie  results  are  summarized  in  tables  19-21.  Figure  11  contains  the  plots  of 
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[able  IT:  Proportions  of  (’oner t  Identification  :  p(s)  —  (0.5) 


r  ARE(l) 

rut(i) 

EAR(l) 

TEAR(l) 

NEAR(l) 

Rob’s  Fixed 

Rob’s  Random 

A  R  E(  1  ) 

;.  _ 

0.0 

0  0 

0.0 

0.0 

0.0 

0.0 

PAR  ( 1 1 

,  0  0 

0.87 

o.O 

0.0 

0.13 

0.0 

0.0 

EAR  ( 1  1 

no 

0  0 

1.0 

0.0 

0.0 

0.0 

0.0 

TEAR  (  1 ; 

n  t» 

0.0 

0.0 

0.43 

0.0 

0.27 

0.3 

r  NEAR(l) 

.  it  0 

0.17 

0.0 

0.0 

0.53 

0.0 

0.0 

Roll's  Fixed 

!  on 

0.0 

0.0 

007 

0.0 

0.63 

0.3 

Rob's  Random 

•  0.0 

0.0 

0.0 

0.3 

_ 00 

0.4 

0.3 

Table  I  S:  Proportions  of  Correct  Identification  :  p(.s)  =  (0.75)s 


ARE(l) 

PAR(l) 

EAR(l) 

TEAR(l) 

NEAR(l) 

Fob’s  Fixed 

Rob’s  Random 

ARE(l) 

1.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

PAR(l) 

0.0 

0.57 

0.0 

d.o 

0.43 

0.0 

0.0 

EAR(l) 

0.0 

0.0 

1.0 

0.0 

0.0 

0.0 

0.0 

TEAR)  1 ) 

0.0 

0.0 

0.0 

0.43 

0.03 

0.2 

0.34 

NEAR(l) 

0.0 

0.53 

0.0 

d.o 

0.47 

0.0 

0.0 

Rob’s  Fixed 

0.0 

0.0 

0.0 

0.23 

0.0 

0.43 

0.34 

Rob’s  Random 

0.0 

0.0 

0.0 

0.3 

0.0 

0.3 

0.4 

the  simulated  4<fc-order  curnulant  structure  for  the  three  models.  The  results  confirm  our 
previous  comment  regarding  the  difficulties  encounterd  by  the  discrimination  procedure  in 
distinguishing  among  these  three  models. 


8  Conclusions 


The  problem  of  discrimination  among  non -linear  time  series  models  is  considered  in  this 
paper  through  the  family  of  exponential  models.  In  this  specific  case  we  are  able  to  develop 
t  he  t  heoretical  3-order  curnulant  structure  and  confirm  it  by  simulation.  The  procedure  we 


Table  lb:  Proportions  of  Correct  Identification:  />(.s)  =  (0.25)s 


TEAR(l) 

Robertson’s  Fixed 

Robertson’s  Random 

TEAR(l) 

0.1 0 

0.27 

0.33 

Robertson’s  Fi.ed 

0.13 

0.54 

0.33 

Robertson’s  Random 

0.10 

0.33 

- - - - 

0.27 

_ 
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'Fable  20:  Proportions  of  Cor  reel.  Identification  :  p(s)  =  (0.5) 


TEAR(l) 

Robertson’s  Fixed 

Robertson’s  Random 

TEAR(l) 

0.27 

0.20 

0.53 

Robertson’s  Fixed 

0.23 

0.40 

0.37 

Robertson’s  Random 

0.27 

0.3 

0.43 

Fable  21:  Proportions  of  Correct  Identification  :  p(s)  =  (0.75)'’ 


TEAR(l) 

Robertson’s  Fixed 

Robertson’s  Random 

TEAR(l) 

0.30 

0.30 

Robertson’s  Fixed 

0.27 

0.33 

0.40 

Robertson’s  Random 

0.27 

0.30 

0.43 

propose  is  not  restricted  to  the  class  of  AK(1)  type  models  or  the  class  of  models  for  which 
analytical  results  for  the  3r(i-order  cumulant  structure  are  available.  It  is  a  general  procedure 
with  the  potential  for  a  wide  range  of  non-linear  models.  It  is  based  in  the  understanding 
that  different  models  cannot  have  an  identical  moment  sequence  ;  hence,  the  discrimination 
among  them  would  become  possible  at  some  stage  in  the  higher  order  cumulant  structure. 
In  our  specific  case  we  are  able  to  obtain  a  significant  improvement  in  our  discriminatory 
power  just  by  going  one  step  above  the  traditional  second  order  moment  analysis  i.e.  the 
correlation  function.  While  second  order  moments  play  a  dominating  role  in  linear  model 
discriminat  ion  they  are  very  limited  in  the  non-linear  case.  When  the  2n,i-order  analysis  fails 
to  provide  enough  information  we  propose  to  apply  higher  order  moment  analysis  for  the 
purpose  of  model  discrimination. 
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Figure  3 
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Figure  5 

ROBERTSONS  FIXED  :  beta  =  0.5  ROBERTSONS  RANDOM  :  alpha 


Figure  7 

ROBERTSONS  FIXED  :  beta  =  0.5  ROBERTSONS  RANDOM  :  alpha 


Figure  10 


Figure  1 1  (cont.) 
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Summary 

In  this  article  we  discuss  ways  in  which  moments  are  used  (a)  to  approximate 
distributions,  and  (b)  to  test  fit  to  a  given  distribution. 


1  Approximating  distributions  using  moments 

Solomon  and  Stephens  (1977)  give  a  number  of  examples  of  statistics  X  for 
which  the  first  few,  or  even  all,  the  moments  or  cumulants  may  be  found,  but 
whose  density  f(x)  and  distribution  F(x),  assumed  continuous,  are  intractable. 
A  good  example  is  the  statistic  S  whose  distribution  is  the  weighted  sum  of 
independent  chi-square  variables,  each  with  one  degree  of  freedom,  written 

s=5>(«.)2  (i) 

i=l 

where  u*  are  i.  i.  d.  N(0, 1),  and  A,-  are  known  weights.  Many  quantities  in  statis¬ 
tics  have  distributions  (often  asymptotic  distributions)  like  5;  for  example,  the 
Pearson  X 2  statistic,  used  in  testing  fit  to  a  distribution  when  the  distribution 
tested  contains  unknown  parameters  which  are  estimated  by  maximising  the 
usual  likelihood,  rather  than  the  multinomial  likelihood,  has  this  distribution 
with  some  A,  ^  1.  Other  goodness-of-fit  statistics,  of  Cramer-von  Mises  type, 
based  on  the  empirical  distribution  function  (EDF),  also  have  such  asymptotic 
distributions  (see,  for  example,  many  examples  in  Stephens,  1986a). 

One  of  the  first  examples  of  5  to  be  tabulated,  for  k  =  2,  involved  errors  in 
target  hitting  during  World  War  2:  tables  for  5  were  produced  with  some  labour 
by  Grad  and  Solomon  (1955)  using  analytic  methods.  These  have  been  extended 
by  various  authors  to  higher  values  of  it,  but  the  analysis  after  k  =  5  or  6  rapidly 
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becomes  very  difficult.  Thus  in  general  it  is  difficult  to  find  exact  percentage 
points  of  S,  but  the  cumulants  /cri  r  =  1,2, are  very  easily  obtained: 

*r  =  2A?2'-1(r-l)!  (2) 

i=i 

2  Moments  and  cumulants 

In  this  section  we  list  definitions.  The  r-th  moment  about  the  origin  of  a  random 
variable  X,  or  equivalently  of  its  distribution  f(x),  will  be  called  p' ;  the  r-th 
moment  about  the  mean  will  be  pr.  The  moment  generating  function  Mx(t)  of 
X  is  defined  by 


Mx(t)  =  f  etxf(x)dx; 
J  —  OO 


when  expanded  as  a  Taylor  series, 


(3) 


^(0*l  +  ^  +  ^  +  ^  +  -  +  ^p  +  - 


(4) 


where  p  =  p\  is  the  mean  of  X. 

Cumulants  Kr  are  defined  through  the  cumulant  generating  function  Cx  (t)  = 
logAfx(t),  where  “log”  refers  to  natural  logarithm.  Then 


C,«)  =  M  +  ^  +  ^  + 


Krtr 

+  -~  + 
r! 


(5) 


Thus  in  principle  we  must  find  Mx(t)  before  finding  Cx(t). 

The  following  relationships  exist  between  low-order  moments  and  cumulants: 
K l  =  M  =  Ml  *2  =  1*2  =  cr2;  K3  =  p3;  =  p4  -  3 p3.  Further  relationships  may 

be  found  in  Kendall  and  Stuart  (1977,  vol  1). 

Suppose  Z  =  X\  +  Xi  +  X3  +  . . .  ■+■  Xt  where  Xi  are  independent  random 
variables.  Then  a  property  of  moment  generating  functions  is 

Mz(t)  =  MXl  ( t )  MX](t)  MXi  {t)...  MXk  ( t ), 

so  that 

Cz(t)  =  CXl{t )  +  Cx,(t)  +  ■  •  •  +  CXk(t),  (6) 

and  it  quickly  follows,  using  obvious  notation,  that 


fCr(Z)  —  +  Kr(Xj)  +  •  •  -  +  KrXk-  (7) 


This  additive  property  makes  it  very  easy  to  find  cumulants  of  sums  of  inde¬ 
pendent  random  variables,  and  hence,  for  example,  the  cumulants  of  S. 
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Two  important  Mx{t)  are  those  of  the  N(fi,<r2)  distribution,  Mx(t)  = 
exp  (/it  +  <r2t2/ 2),  and  the  \2  distribution,  Mx(t)  =  1/(1  -  2  <)p^2.  Finally, 
it  is  easily  shown  that  fir(aX  -f  b)  =  ar/ir(A),  for  r  >  2,  where  a  and  6  are  any 
real  constants,  and  Kr(aX  4-6)  =  arKr(X),r  >  2. 

As  an  example,  consider  S.  If  X  has  a  x2  distribution,  the  MGF  of  X 
is  1/(1  -  2 1)1/2;  thus  Cx{t)  —  l°g(  1  —  2<),  and  expansion  gives  Cx{i)  = 
t  +  2t2  +  +  •  •  •■  Thus  the  r-th  cumulant  of  X  is  Kr  =  2r-1(r  —  1)!, 

that  of  A iX  is  A *Kr,  and  by  the  additive  property  (7),  the  r-th  cumulant  of  S 
is  given  by  the  expression  (2). 

3  Mathematical  approximations 

The  approximations  in  this  section  are  called  “mathematical”  because  they  are 
based  on  mathematical  analysis,  with  known  properties  of  accuracy  and  conver¬ 
gence,  in  contrast  to  those  to  be  considered  later. 

Suppose  n(f)  is  the  standard  normal  density 

n(t)  =  e~t',2/\f2n  (8) 

and  let  f(x)  be  the  (continuous)  density  of  A.  Then  it  is  (nearly  always)  possible 
to  expand  f(x)  as 

/(*)  =  n(x)  1 1  +  Ifo  -  l)Ht(x)  +  \fiMx)  +  ^(m-  6/12  +  3 )HA(x)  +  . .  j 

(9) 

called  a  Gram-Charlier  series.  The  Hr{x)  are  Hermite  polynomials.  Lists  of 
Hermite  polynomials,  and  also  conditions  for  convergence,  etc.,  are  given  in 
Kendall  and  Stuart  (1977,  vol.  1). 

The  basic  technique  involved  in  deriving  (9)  rests  on  the  fact  that  Hermite 
polynomials  are  orthogonal  with  respect  to  the  kernel  n(x);  thus 

r  tf.(x)  Hf(x)  n{x)  dx  =  {  °; \tj.  (10) 

J-o O  k  J't  *  j' 

Then  if  f(x)  =  c* r»(x) //j(x),  multiplication  by  Hj(x)  on  both  sides,  and 

integration,  gives 

ci  =  /  f{x)Hj(x)dx/j\ 

J  ~oo 

.  When  worked  out,  c2  =  (/j2  —  1)/2,C3  =  P3/6,  etc. 

If  an  infinite  set  of  moments  is  available,  as  for  S,  the  density  can  be  ap¬ 
proximated  very  accurately  using  a  Gram-Charlier  series  of  sufficient  length,  but 
there  are  many  statistics  in  practical  applications  for  which  it  is  difficult  even 
to  get  the  first  four  moments  —  see  Solomon  and  Stephens  (1977)  for  examples. 
There  are  two  other  important  drawbacks: 
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1.  A  k- term  fit  might,  at  any  one  value  of  x,  be  worse  than  a  (k  —  l)-term 
fit. 

2.  Gram-Charlier  series  with  finite  numbers  of  moments  can  give  a  negative 
density  f(x),  particularly  in  the  tails. 

3.1  Percentage  points  approximation 

A  Gram-Charlier-type  expansion  can  also  be  found  for  F(x),  the  distribution 
function  of  X;  this  can  be  inverted  to  give  a  percentage  point  for  a  given  cumu¬ 
lative  area  a.  Thus  suppose  F(xa)  =  a;  we  want  an  approximation  to  xa.  A 
Cornish-Fisher  expansion  gives  i-f  as  a  series  in  Hermite  polynomials  in 
x,  or  (more  practically  useful)  in  £,  where  £  is  the  percentile  corresponding  to 
a  for  the  normal  distribution,  that  is,  £  is  the  solution  of 

f  n(x)dx  =  a.  (11) 

J —OQ 

Again,  problems  can  arise  with  the  convergence  to  the  desired  xa.  For  more 
details  on  mathematical  expansions  of  Gram-Charlier  or  Cornish-Fisher  type, 
see  Kendall  and  Stuart  (1977,  vol.  1). 


4  Pearson  curves  and  other  systems 


We  now  turn  to  a  method  of  approximation  which  can  be  thought  of  as  “laying 
one  curve  upon  another”  —  the  approximating  curve  has  parameters  which  can 
be  varied  to  make  a  good  fit.  The  parameters  are  usually  chosen  by  matching 
moments  or  cumulants.  Percentage  points  of  the  approximating  curve,  which 
are  tabulated  or  otherwise  easily  found,  are  then  used  as  approximations  to  the 
desired  points. 

A  family  of  approximating  curves  is  the  Pearson  system,  where  the  (contin¬ 
uous)  density  f(x)  is  approximated  by  f*(x),  given  by 


1  df'(x)  _  a  +  x 
f‘(x)  dx  bo+bix  +  bix2' 


(12) 


According  to  the  values  of  the  constants  <1,60,61,62,  integration  of  the  right- 
hand  side  will  take  many  forms,  giving  great  flexibility  to  the  system  of  den'  ties 
/*(*).  With  considerable  algebra  (see  Elderton  and  Johnson,  1969,  for  details), 
the  constants  may  be  put  in  terms  of  the  moments: 


Suppose  A  =  10^^2  -  18/4  ~  12/4;  ^en 

a  _  /<3(/<4  +  3/4) 

A 


(13) 

(14) 
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bo  = 

-Pi{lpiP*  -  3/4) 

A 

(15) 

6i  = 

-a; 

(16) 

bi  = 

-(2 pip4  -  3/4  -  12/4) 

(17) 

A 

Thus  knowledge  of  the  first  four  moments  or  cumulants  of  X  will  fix  the  con¬ 
stants  above:  a  further  constant  C  enters  on  integrating,  but  is  fixed  by  the  fact 
that  the  total  integral  of  f"(x)  must  be  1. 

4.1  Percentage  points 

When  the  constants  are  known,  the  density  f*(x)  may  be  integrated  and  per¬ 
centage  points  solved  for  numerically.  Over  the  years,  this  was  done,  at  first 
very  laboriously,  for  a  small  range  of  possibilities,  but  a  quite  extensive  tab¬ 
ulation  was  made,  using  electronic  computers,  in  the  late  ’60s.  These  tables 
are  in  Biometrika  Tables  for  Statisticians,  vol.  II.  The  form  of  the  tables  is 
as  follows.  The  percentage  points  for  X ,  the  standardised  X-variable  given  by 
X  -  (x  —  /i)/<r,  are  plotted  in  a  two-way  table  indexed  by  the  skewness  and 
kurtosis  parameters  p\  and  0i.  These  are  defined  by 

A  =  4  and  0 2  =  (18) 

Pi  Pi 

they  have  been  defined  to  be  scale-free,  and  \f]T\  takes  the  sign  of  p3.  0i 
measures  skewness:  a  large  (positive)  means  the  curve  is  skewed  towards 
positive  values  (long  tail  is  to  the  right)  and  vice  versa  for  negative  A 

large  (h  (always  positive)  means  the  density  has  heavy  tails.  Of  course,  all 
symmetric  distributions  have  0\  —  0;  a  benchmark  to  measure  kurtosis  is  the 
normal  distribution  for  which  0i  =  3.  Since  k4  =  /i4  —  3/4,  the  parameter 
72  =  0i  —  3  =  k4/kI  can  also  be  regarded  as  measuring  kurtosis,  with  value 
72  =  0  for  the  normal  distribution. 

Suppose,  for  a  given  5,  we  have  y/]S[  =  0.8  and  02  =  4.6.  To  use  Biometrika 
Tables,  one  enters  the  appropriate  yffil  table,  =  0.8,  and  travels  down 
the  left-hand  column  until  the  0i  value,  4.6,  is  reached.  Along  the  row  are  17 
tabulated  percentage  points  for  X,  from  a  =  0.00  to  a  =  1.00.  Interpolation 
must  be  used  for  \f$\,0i  values  not  explicitly  given. 

4.2  Un  peu  d’histoire 

At  this  point,  perhaps,  it  might  be  permitted  to  enliven  the  account  with  what 
the  Guide  Michelin  calls  an  peu  d’histoire.  At  the  time  Biometrika  TablesV ol. 
II  were  being  prepared,  I  was  fortunate  enough  to  know  Professor  E.  S.  Pear¬ 
son,  then  retired  but  still  very  active,  especially  as  Editor  of  Biometrika.  He 
had  collaborated  with  workers  in  the  U.  S.  to  get  the  tables  (Johnson,  Nixon, 
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Amos  and  Pearson,  1963)  and  had  carefully  compiled  the  full  set  by  hand.  He 
had  introduced  me  to  Pearson  curves,  which,  to  put  it  mildly,  did  not  figure 
prominently  in  statistical  training  of  the  day,  and  had  shown  me  how  effective 
they  could  be.  He  gave  me  a  copy  of  the  tables  to  use.  I  undertook  to  write 
a  Fortran  program  on  the  IBM  650,  to  interpolate  and  find  points,  given  the 
first  four  moments.  All  20  tables  were  then  typed  onto  punched  cards;  in  the 
end,  I  got  it  down  to  oroximately  45  minutes  per  table.  This  is  not  such  a 
dramatic  piece  of  history  as  Michelin  usually  provides  (assignations  and  assas¬ 
sinations  often  play  a  prominent  role),  bat  a  diminishing  generation  of  modern 
readers  will  still  empathise  with  the  fears  of  losing  the  boxes  of  cards,  getting 
them  wet  in  the  snows  of  Montreal,  etc.,  not  to  mention  the  awful  discovery  of 
a  wrongly-typed  number! 

Since  then,  programs  have  been  written  to  integrate  the  density  equation 
for  f‘(x)  numerically  and  to  solve  for  xa  for  given  a,  or  to  provide  the  tail 
area  for  given  x;  one  of  these,  kindly  given  to  me  by  Amos  and  Daniel  (1971), 
has  been  added  to  my  program;  this  greatly  increases  the  range  of  P\  and  /?2 
for  which  Pearson  curve  approximations  can  be  found.  However,  points  are 
still  output  from  both  the  Amos  and  Daniel  part  of  the  program  and  by  the 
Biometnka  Tables  part,  ostensibly  as  a  check  where  available,  but  truthfully  as 
a  sentimental  tribute  to  E.  S.  P. 

Later  on,  Charles  Davis  and  I  (Davis  and  Stephens,  1983)  added  to  the 
program  to  enable  a  fit  to  be  made  using  knowledge  of  an  end  point  (for  example, 
that  the  left-hand  endpoint  of  S  is  zero)  and  three  moments.  This  is  especially 
valuable  for  the  type  of  statistic  for  which  each  successive  moment  requires 
exponentially  increasing  hard  work  —  for  example,  the  distribution  of  areas,  or 
perimeters,  of  polygons  formed  by  randomly  dropping  lines  on  a  plane  —  see 
Solomon  and  Stephens  (1977).  The  Pearson-curve  fitting  program  is  available 
from  the  author. 

Further  developments  have  included  algorithms  to  facilitate  use  of  Pearson 
curves  —  see,  for  example,  Bowman  and  Shenton  (1979a,  1979b). 

4.3  Accuracy  of  Pearson  curve  fits 

(a)  Pearson  curve  densities  are  unimodal,  or  possibly  J-  or  U-shaped,  but  never 

multimodal.  They  are  also  never  negative. 

(b)  Percentage  points  or  tail  areas  found  from  Pearson  curve  fitting  have  been 

found,  for  unimodal  long-tailed  distributions,  to  be  very  accurate  in  the 
long  tail,  at  least  for  tail  areas  bigger  then  0.005,  or  the  0.5%  point. 
Pearson  and  Tukey  (1965)  discuss  this  issue;  Solomon  and  Stephens  (1977) 
give  comparisons.  (In  making  comparisons,  one  must  of  course  compare 
the  Pearson  curve  fit  with  the  correct  x0,  or  the  correct  area  for  given  x, 
for  a  distribution  which  is  not  itself  a.  member  of  the  Pearson  family.) 
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(c)  Davis  (1975)  has  made  extensive  comparisons  with  Gram-Charlier  fits  using 
only  four  moments.  Pearson  curve  fits  are  better  than  Gram-Charlier  fits 
everywhere  except  for  distributions  very  close  to  the  normal,  as  measured 
by  the  0i ,  02  values. 

4.4  Other  systems 

Johnson  (1949)  has  proposed  another  family  (divided  into  three  parts)  of  curves 
defined  by  four  moments:  for  example,  the  Su  curves  are  those  given  by  the 
relation 

£  =  7  +  5sinh-1  X  (19) 

where  X  =  (x  —  n)/cr,  and  y ,6  are  to  be  chosen  to  make  the  distribution  of 
£  as  close  as  possible  to  N(0, 1).  A  discussion,  and  tables  to  facilitate  the 
calculation  of  y  and  6 ,  are  in  Biomeirika  Tables  for  Statisticians  Vol.  II.  Other 
authors  have  also  proposed  families  of  distributions,  but  they  have  not  come 
into  such  common  use  for  the  purpose  of  approximating  percentage  points. 

5  Use  of  higher  moments 

We  now  turn  to  the  first  of  two  interesting  questions  —  can  higher  moments 
be  used  to  improve  the  accuracy  of  Pearson  curve  fits  in  the  long  tail  of  the 
distribution?  The  long  tail  will  be  supposed  to  lie  to  the  right,  as  for  the 
distribution  of  S\  then,  since  higher  values  of  x  will  contribute  more  to  the 
higher  moments  than  smaller  values,  we  might  suppose  that  fits  using  higher 
moments  will  improve  accuracy.  Unfortunately  it  is  not  easy  to  establish  the 
four  constants  in  terms  of  higher  moments  —  of  course,  only  four  of  these  would 
be  needed  to  fix  the  constants.  A  recursion  formula  exists  to  generate  higher 
moments,  for  r  =  2, 3, . . 

r*oA«r-i  +  {(r  +  +  a}^r  +  {(r  +  2)i>2  +  l}Pr+i  =  0  (20) 

In  this  recursion,  the  constants  a,  bo,  fci  and  62  occur,  and  this  means  that  one 
cannot  reverse  the  recursion  and  generate  ,  say,  /i  and  a2  from  and  /< 6- 

Nevertheless,  one  can  generate  the  fifth  and  sixth  moments  of  the  Pearson 
curve  with  the  same  first  four  moments  of,  say,  S,  and  compare  them  with  the 
true  fifth  and  sixth  moments  of  5.  The  first  two  moments  are  then  slightly 
changed,  and  the  procedure  successively  repeated,  until  the  third,  fourth,  fifth 
and  sixth  moments  of  each  curve  match.  This  will  mean  that  the  mean  and 
variance  of  the  Pearson  curve  will  not  be  exactly  the  same  as  those  for  S, 
although  they  will  be  close,  and  this  will  probably  make  a  worse  fit  in  the  lower 
tail;  but  for  higher  x  the  fit  could  improve.  I  have  made  some  comparisons  using 
this  procedure,  but,  as  one  might  expect,  there  appears  to  be  no  systematic 
improvement.  In  discussion,  when  this  paper  was  first  presented,  the  suggestion 
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was  made  to  use  Least  Squares  to  make  “closest”  fits,  in  order  to  compare  the 
six  moments.  More  work  is  needed  to  compare  Pearson  curve  fits  along  these 
various  lines,  but  it  is  not  likely  that  the  improvement  will  be  sure,  or  will 
extend  to  points  far  into  the  tails.  In  the  end  it  must  be  remembered  that  one 
curve  is  simply  being  laid  on  top  of  another,  with  only  four  parameters  to  vary, 
and  there  is  no  mathematical  analysis  that  will  guarantee  accuracy. 

Other  methods  for  developing  accuracy  in  the  extreme  tails  include  numerical 
inversion  of  the  Characteristic  Function  (essentially  the  MGF  with  it  replacing  t, 
where  i  =  or  saddlepoint  approximations.  A  method  due  to  Imhof  (1961) 

uses  numerical  inversion  for  distributions  such  as  S,  but  the  computer  time 
needed  increases  rapidly  as  the  distance  into  the  tails  increases  (to  give  small 
tail  areas).  Field  (1992)  has  recently  examined  saddle-point  approximations  for 
5.  These  would  seem  to  give  more  promise  of  tail-end  accuracy  in  the  long  run. 

6  Use  of  sample  moments 

The  second  interesting  question  is:  how  accurate  are  Pearson  curve  fits  when 
sample  moments  are  used  to  make  the  fit?  In  the  earliest  days,  this  was  the  use 
to  which  Pearson  curves  were  applied  —  to  find  a  smooth  density  to  describe 
a  set  of  data,  such  as  lengths  of  beans,  or  width  of  skulls.  Kendall  and  Stuart 
(1977,  Vol.  1  )  gives  details  of  such  a  fit.  In  general,  the  Pearson  curves  will  give 
very  good  fits  to  a  unimodal  set  of  data,  or  even  to  J-shaped  or  U-shaped  sets, 
but  it  is  important  to  assess  the  accuracy  of  extrapolation  from  the  sample  to 
the  supposed  population  from  which  it  came.  More  precisely,  we  ask  how  close 
the  sample  fit  estimate  of,  say,  the  upper-tail  5%  point  is  to  the  true  population 
5%  point,  and,  further,  whether  or  not  the  Pearson-curve  point  is  better  than 
the  estimated  point  derived  from  choosing  the  appropriate  order  statistic  —  in  a 
sample  of  1000,  the  951st  value  in  ascending  order,  or  in  a  sample  of  size  10000, 
the  9501st  value.  Some  investigation  of  these  questions  has  been  undertaken  in 
two  quite  different  ways,  by  Johnstone  (1988)  and  by  myself  (Stephens,  1991). 

The  accuracy  of  the  Pearson  curve  point  will  depend  on: 

1.  the  sample  size  n, 

2.  the  ar-level  (tail  area)  of  the  point  required, 

3.  the  true  skewness  and  kurtosis  of  the  density  approximated, 

4.  higher  moments. 

Johnstone  gives  a  small  study,  for  samples  from  populations  with  the  following 
range  of  parameters: 


Pi 

0.0 

0.0 

1.0 

1.0 

2.0 

07 

3.3 

4.0 

5.25 

6.0 

7.5 
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Johnstone  gives  plots  of  the  estimated  coefficient  of  variation,  CV,  of  the 
Pearson  curve  xa  against  -  logo  ,  where  the  base  of  logarithms  is  10.  Thus  the 
CV  of  the  estimated  xo  oi  is  plotted  against  2,  that  of  the  estimated  Xo.ooi  is 
plotted  against  3,  etc  .  The  coefficient  of  variation  is  estimated  using  a  Taylor 
series  approximation.  As  one  might  expect,  the  CV  goes  up  markedly  as  a  gets 
smaller  (so  —  logo  gets  larger  on  the  x-axis),  and  the  steepness  of  the  rise  is 
greater  for  the  more  skew  distributions  . 

In  Stephens  (1991),  Monte  Carlo  samples  were  taken  from  populations  for 
which  exact  percentage  points  could  be  found,  and  the  exact  points  were  com¬ 
pared  with  those  obtained  from  (a)  Pearson  curve  fits  using  the  moments  of 
each  sample,  a.  d  (b)  the  order  statistic  estimate  from  each  sample.  The  order 
statistic  estimate  will  be  asymptotically  unbiased,  while  one  can  say  nothing 
exact  about  the  point  obtained  by  laying  one  curve  on  another;  recall  that  sam¬ 
ple  moments,  especially  the  third  and  fourth,  are  extremely  variable,  even  for 
quite  large  samples.  The  results  showed,  as  expected,  that  the  Pearson  curve 
points  were  more  biased.  However,  somewhat  surprisingly,  they  had  smaller 
mean  square  error.  Therefore,  it  might  well  be  preferable  to  use  the  Pearson 
curv^  points,  although,  again,  more  investigations  should  be  made  especially  if 
the  points  required  are  far  into  the  tail. 


7  Goodness  of  fit  using  moments 

In  this  second  part  of  the  paper,  we  discuss  how  moments  are  used  in  Goodness- 
of- Fit ,  that  is,  to  test  whether  a  random  sample  comes  from  a  given  (continuous) 
distribution.  The  distribution  will  often  have  unknown  parameters,  which  must 
be  estimated  from  the  given  sample. 


7.1  Tests  based  on  skewness  and  kurtosis 

Suppose  the  r-th  sample  moment  mr  about  the  mean  is  defined  by 

mr  =  i^(x,- -x)r.  (21) 

»  =  1 


The  sample  skewness  and  sample  kurtosis  are  then  defined  by 


m  3  m4 

o  l  =  - 3  7  02  =  - 2  • 


m. 


m; 


(22) 


These  statistics  are  not  unbiased  estimates  of  and  /?2>  but  they  are  consistent, 
that  is,  the  bias  diminishes  with  increasing  sample  size.  The  sample  skewness 
and  kurtosis  are  time-honoured  statistics  for  testing  normality,  having  been  used 
in  a  rather  ad  hoc  manner  for  most  of  this  century;  fri  is  compared  with  zero, 
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and  l>2  with  3,  the  value  of  /?2  for  the  normal  distribution.  However,  distribu¬ 
tion  theory  of  fq  anti  60  is  difficult,  and  it  is  only  since  computers  have  been 
available  that  extensive  and  reliable  tables  of  significance  points  have  existed  for 
these  statistics.  Further,  6t  and  62  can  be  combined  to  give  one  overall  statistic 
(d’Agostino  and  Pearson,  1973,  1974;  d’Agostino,  1986).  For  other  distributions 
Bowman  and  Shenton  (1986)  have  also  given  tables  for  these  statistics.  Stud¬ 
ies  have  shown  that  skewness  and  kurtosis,  especially  combined,  provide  good 
omnibus  tests  for  normality,  although  less  is  known  for  other  distributions.  For 
the  important  discrete  distribution,  the  Poisson,  all  cumulants  are  equal  to  the 
mean,  denoted  by  the  parameter  A;  a  time-hom  ured  test  for  the  Poisson  is 
based  on  the  ratio  of  sample  variance  to  sample  mean,  which  of  course  should 
be  about  one.  Again,  this  simple  statistic  appears  to  compete  well  with  others 
in  terms  of  power. 

7.2  A  formal  technique  based  on  moments 

Perhaps  because  of  the  variability  of  sample  moments,  which  makes  calculation 
of  significance  points  difficult  for  statistics  based  on  these  moments  when  calcu¬ 
lated  from  samples  of  reasonable  size,  it  took  some  time  to  formalize  a  technique 
based  on  moments.  Gurland  and  Dahiya(1970)  and  Dahiyaand  Gurland  (1972) 
have  however  devised  a  general  procedure.  The  essential  steps  are  as  follows: 

1.  \  vector  C  of  length  s,  say,  must  be  found,  whose  components  are  func¬ 
tions  of  the  theoretical  moments,  and  sucl.  that  each  component  C«  is  linear 
in  the  parameters.  (This  might  involve  re-parametrising  the  distribution 
from  its  usual  form). 

2.  The  estimate  h  of£  is  obtained  by  replacing  theoretical  moments  by  sam¬ 
ple  moments. 

3.  The  test  statistic  is  then  based  on  the  difference  h  —  (. 

Suppose  that  E  is  the  covariance  matrix  of  /»,  8  is  the  g-vector  of  unknown 
parameters,  and  IV'  is  the  s  x  q  matrix  such  that  C  =  WO.  Then  define 

Q,  =  n{h  -  W8)'t~'(h  -  WB), 

where  0  =  (VV"E~ 1  H7)"’  W"S“  lh.  The  statistic  8  is  the  regression  estimate  of 
0  obtained  by  generalized  least  squares,  and  E  is  E  with  the  estimate  9  used 
wherever  8  appears. 

Gurland  and  Dahiya  (1970.  1972)  showed  that,  asymptotically,  the  test 
statistic  Q i  has  the  distribution  with  i  —  —  q  degrees  of  freedom.  Currie  and 
Stephens  (1986,  1990)  have  studied  the  procedure,  and  show  several  properties 
of  Q,  Among  these  are  the  fact  that  the  test  statistic  Qt  can  be  broken  into 
f  components,  eaJi  with  asymptotic  distribution  and  each  testing  different 
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features  of  the  distribution.  Each  component  is  a  function  of  moments  or  cumu- 
lants.  For  example,  consider  the  test  for  normality,  that  is,  for  the  distribution 
N(ii,(t2).  Gurland  and  Dahiya  (1970)  took  C  =  (p,  log/i2, ^3,  log(^i4/3) } ,  so 

’”“10“ 


that  h'  =  {.r ,  log  m2,  m3,  log(m4/3)}.  The  matrix  W  is  W  = 


0  1 
0  0 
0  2 


and 


9  = 


log  a1 


The  test  statistic  Qi  becomes  C1+C2,  where  the  two  components 


are  ci  =  nm^/Omij  and  62  =  (3n/8){log(m,i/3m:>)}.  Thus  the  method  leads  to 
nb\/6  and  (3n/8)  log(62/3)  as  test  statistics,  equivalent  to  the  “old-fashioned” 
61  and  bo¬ 
ll owe  ver,  it  should  be  noted  that  the  components  are  not  unique;  they  de¬ 
pend  on  how  (  is  formed.  Currie  and  Stephens  (1986,  1990)  discuss  these 
questions  in  some  detail. 


8  Components  of  other  goodness-of-fit  statis¬ 
tics 

Other  goodness-of-fit  statistics  also  have  components  which  are  functions  of 
moments.  The  oldest  of  these  was  proposed  by  Neyman  (1937),  in  connection 
with  a  test  for  uniformity. 

A  test  for  a  fully  specified  continuous  distribution  (that  is,  all  parameters 
known)  can  always  be  converted  to  a  test  for  uniformity  by  means  of  the  Prob¬ 
ability  Integral  Transformation,  and  a  test  for  the  exponential  distribution  can 
also  be  so  converted,  even  when  the  scale  and  origin  parameters  are  not  known, 
so  that  Neyman’s  test  has  wider  applicability  than  it  might  at  first  appear.  (For 
details  of  these  transformations,  see  Stephens,  1986a,  1986b). 

Neyman’s  test  is  as  follows:  suppose  the  test  is  that  Z  has  a  uniform  distri¬ 
bution  between  0  and  1,  written  1/(0, 1).  On  the  alternative,  let  the  logarithm 
of  the  density  of  Z  be  expanded  as  a  series  of  Legendre  polynomials: 

>og(/(2))  =  ;4(c){l  +  cyLi(z)  +  c2L2(z)  +  c3L3(z)  +•••},  (23) 

where  the  a  are  coefficients,  components  of  the  vector  c,  Li(z)  is  the  j-th 
Legendre  polynomial,  and  A(c)  is  a  normalising  constant. 

A  test  for  uniformity  is  then  a  test  that  all  cf-  =  0.  The  estimates  of  c<  are 

n 

(24) 

;=i 

where  Z\ ,  z2, . . . ,  z„  is  the  given  sample. 
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The  first  few  Legendre  polynomials  are  best  expressed  in  terms  of  y  =  2—0.5. 
Then 

Lt(z)  =  2  yfiy,  (25) 

L2(z)  =  \/5(6y2  -  0.5),  (26) 

Lz{z)  =  x/7(20y3  -  3y),  (27) 

so  that  the  estimate  Ci  becomes  a  function  of  the  first  moment  about  the  known 
mean  0.5,  the  second  estimate  c2  becomes  a  function  of  the  second  moment,  63 
a  function  of  both  the  third  and  the  first  moments,  etc. 

Neyman  shows  that  the  suitably  normalised  c<  have  asymptotic  ATT  1)  dis¬ 
tributions,  and  his  overall  test  statistic  is  the  sum  of  the  squares  of  thes.  no  - 
malised  estimates.  Thus  the  overall  statistic  has  an  asymptotic  x2  distribution, 
just  as  for  the  Dahiya-Gurland  statistic,  and  the  individual  terms,  based  on 
moments,  are  the  components  of  the  overall  test  statistic. 

9  EDF  statistics 

Another  important  family  of  goodness-of-fit  statistics  is  that  derived  from  the 
Empirical  Distribution  Function  (EDF)  of  the  2-sample.  This  family  includes 
the  well-known  Kolmogorov-Smirnov  statistic  and  the  Cramer-von  Mises  family 
of  statistics  (for  details  and  tests  for  many  distributions  based  on  these,  see 
Stephens,  1986a). 

One  of  the  most  important  of  the  Cramer-von  Mises  class  is  A2,  introduced 
by  Anderson  and  Darling  (1954).  The  definition  of  A2  is  based  on  an  integral 
involving  the  difference  between  the  EDF  and  the  tested  distribution  F(x)  (with 
parameters  estimated  if  necessary).  The  working  formula  is 

l2  =  ~n~~  B2‘  ~  !)  [loSz(<)  +  M1  "  *(n+i— <))]  -  (28) 

i 

where  2;  =  F(xi),  and  2(,)  are  the  order  statistics. 

As  an  omnibus  test  statistic,  A2  has  been  shown  to  perform  well  in  many 
test  situations. 

Anderson  and  Darling  showed  that  the  asymptotic  distribution  of  A2  is, 
like  S  of  Section  1,  a  sum  of  weighted  x2  variables.  The  individual  terms 
in  the  sum  can  again  be  regarded  as  components  of  the  entire  statistic,  and 
Stephens  (1974)  has  investigated  these  components  in  some  detail.  A  remarkable 
result  is  that  they  too  are  based  on  Legendre  polynomials,  so  that  they  are 
effectively  the  same  as  the  Neyman  components,  based  on  moments  of  the  z- 
sample.  There  has  been  some  investigation  of  components  of  these  and  other 
statistics,  as  individual  test  statistics  for  the  distribution  under  test;  references 
are  given  by  Stephens(  1986a).  As  for  the  Gurland-Dahiya  components,  they  can 


229 


be  expected  to  be  sensitive  to  different  departures  from  the  tested  distribution. 
The  complete  test  statistics  of  Neyman  and  of  Anderson-Darling  combine  the 
same  components,  but  with  different  weightings. 
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Abstract 

Higher-order  statistics  (1IOS)  are  now  very  widely  used.  Two  areas  where  they 
begin  receiving  considerable  attention  are  array  and  speech  processing.  This  paper 
describes  some  recent  applications  of  IIOS  in  both  areas  by  the  authors  [18]-[20]. 

In  our  speech  processing  application,  we  demonstrate  a  way  to  better  discriminate 
between  voiced  and  unvoiced  speech.  This  is  accomplished  by  observing  the  behavior 
of  a  cumulant-based  adaptive  filter,  and  makes  use  of  the  fact  that  unvoiced  speech  is 
Gaussian,  whereas  voiced  speech  is  definitely  non-Gaussian.  We  have  also  shown  a  way 
to  utilize  the  prediction  residual  from  the  adaptive  filter  to  estimate  the  pitch  period 
for  voiced  speech. 

Array  processing  encompasses  a  multitude  of  problems,  including  beamforming 
and  direction-of-arrival  (DOA)  estimation.  We  have  developed  fourth-order  cumulant- 
based  blind  optimum  beamforming  algorithms  that  outperform  existing  methods.  The 
term  blind  indicates  that  our  methods  do  not  require  a  priori  knowledge  of  array  geom¬ 
etry  and  DOA.  nor  they  are  affected  by  multipath  propagation  and  presence  of  smart 
jammers.  Extensive  simulations  support  our  theoretical  claims  on  the  optimality  of 
our  beam  forming  procedure. 


1  Introduction 

Our  work  on  speech  processing  describes  a  method  that  consists  of  an  adaptive  predictor,  a  voicing 
decision  (V/UV),  and  a  pitch  period  estimator.  The  focus  of  this  study  is  on  robust  detection  of 
speech  state  and  estimation  of  pitch  period.  This  is  accomplished  by  observing  the  behavior  of  an 
adaptive  predictor  which  processes  the  speech  signal.  Higher-order-  statistical  analysis  is  proposed 
for  discrimination  of  speech  states.  Comparing  the  energy  of  the  original  speech  signal  with  that 
of  the  prediction-error  residual  yields  the  decision  method.  Both  covariance  and  cumulant-based 
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prediction  methods  are  investigated  and  the  latter  is  shown  to  he  a  more  robust  way  of  making 
(Y/l'V)  decision.  Pitch  estimation  is  accomplished  by  using  correlation-based  approaches  that 
operate  on  the  energy  estimate  of  the  cumuiant-based  predic  tion  residual  rather  than  the  original 
speech  signal.  Pitch  estimation  by  our  method  yields  better  performance  than  currently  existing 
batch  procedures. 

Array  processing  work,  as  described  in  this  paper,  addresses  the  problem  of  blind  optimum 
beamforming  for  a  non-Gaussian  desired  signal  in  the  presence  of  interference.  Sensor  response, 
location  uncertainty  and  use  of  sample  statistics  can  severely  degrade  the  performance  of  optimum 
beamforrners.  In  this  paper,  we  propose  blind  estimation  of  the  source  steering  vector  in  the  pres¬ 
ence  of  multiple,  directional,  correlated  or  coherent  Gaussian  interferers  via  higher-order-statistics. 
In  this  way,  we  employ  the  statistical  characteristics  of  the  desired  signal  to  make  the  necessary  dis¬ 
crimination,  without  any  a-priori  knowledge  of  array  manifold  and  direction-of-arrival  information 
about  the  desired  signal.  We  then  improve  our  method  to  utilize  the  data  in  a  more  efficient  man¬ 
ner.  In  any  application,  only  sample  statistics  are  available,  so  we  propose  a  robust  beamforming 
approach  that  employs  the  steering  vector  estimate  obtained  by  cumuiant-based  signal  processing. 
We  further  propose  a  method  that  employs  both  covariance  and  cumulant  information  to  combat 
finite  sample  effects.  We  analyze  the  effects  of  multipath  propagation  on  the  reception  of  the  desired 
signal.  We  show  that  even  in  the  presence  of  coherence,  cumuiant-based  beamformer  still  behaves 
as  tlx  optimum  beamformer  that  maximizes  the  Signal  to  Interfere!,*  pLs  \oise  Ratio  (SINK). 
Finally,  we  propose  an  adaptive  version  of  our  algorithm.  Simulations  demonstrate  the  excellent 
performance  of  our  approach  in  a  wide  variety  of  situations. 

2  Cumulant-Based  Adaptive  Analysis  of  Speech  Sig¬ 
nals 

Voiced/ Unvoiced  ( V/FV)  decision  is  an  important  problem  in  speech  processing.  Almost  all  speech 
coding,  recognition  and  speaker  identification  systems  require  this  information  for  an  effective 
processing  of  speech  data.  In  addition,  low-delay  speech  processing  systems  require  this  decision 
be  provided  in  real-time.  In  [2]  some  commonly  employed  features  are  described,  and  a  subset  of 
them  are  used  to  train  an  artificial  neural  network  to  perform  V/UV  decision. 

In  frame-based  analysis  of  speech  signals,  feature  extraction  is  performed  on  the  current  block 
of  data,  and  a  decision  is  given  at  the  end  of  the  period.  For  this  reason,  frame-based  methods 
are  incapable  of  tracking  rapid  changes  in  signal  characteristics.  Transitions  of  the  state  of  speech 
within  a  frame  period  affect  the  decisions  resulting  from  a  frame-based  analyzer.  In  general,  this 
mixed  state  of  speech  within  a  period  can  not  be  identified  and  incorrect  decisions  will  be  made. 
This  will  degrade  the  performance  of  the  overall  speech  processing  system.  In  addition,  frame-based 
analysis  introduces  delay,  which  may  not  be  tolerable  in  low-delay  systems. 

Severe  non-stat ionarity  observed  in  speech  signals  and  low-delay  requirements  of  the  contem¬ 
porary  speech  processing  systems  motivate  the  use  of  adaptive  algorithms  for  feature  extraction 
in  place  of  their  batch  counterparts.  In  general,  adaptive  processing  techniques  are  designed  to 
minimize  some  least-squares  error  criterion.  Their  use  is  motivated  by  the  assumption  that  the 
processes  are  Gaussian  and  the  performance  analysis  is  tractable  with  this  assumption  [3];  how¬ 
ever.  this  approach  ignores  the  non-Gaussian  nature  of  the  underlying  signal. 

Adaptive  prediction  of  the  incoming  signal  and  continuous  monitoring  of  prediction  error  power 
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Figure  1:  lypieal  speech  signals:  (a)  Unvoiced  speech,  (b)  Voiced  speech. 


makes  detecting  changes  in  the  spectral  characteristics  of  the  process  possible.  We  may  consider 
such  a  change  as  an  emit.  After  an  event.,  an  adaptive  unit,  will  require  a  period  to  adjust  it  self 
fur  the  new  configuration.  During  this  learning  period,  prediction  error  power  will  temporarily 
increase.  This  observation  was  used  in  [35],  to  detect  abrupt  changes  in  the  autoregressive  (AR) 
parameters  of  a  linear  process.  If  a  lattice  form  is  used  rather  than  a  finite  impulse  response  (FIR) 
filter,  reflection  coefficients  will  be  available  for  monitoring  purposes.  In  addition,  adaptive  lattice 
filters  exhibit  better  learning  characteristics  than  their  FIR  counterparts.  This  may  improve  the 
ability  to  localize  the  event,  when  prediction  error  power  is  monitored. 

In  this  study,  we  shall  investigate  the  application  of  adaptive  prediction  methods  to  detect 
V/FV  transitions  in  speech  signals;  hence,  events  of  interest  will  be  V/UV  or  VV/V  transitions. 
Our  approach  will  take  the  speech  production  model  into  account  and  utilize  higher  than  second- 
order  statistics  of  speech  signals. 


2.1  Speech  Production  Model 

The  state  of  speech  signal  belongs  to  three  categories:  voiced,  unvoiced  and  silence.  Silent  periods 
can  be  detected  easily  by  monitoring  zero  crossing  rate  and  energy  of  the  received  signals  [53].  For 
this  n  ason.  we  shall  concentrate  on  voiced/unvoiced  classification  of  speech. 

Unvoiced  sounds  are  generated  by  forming  a  constriction  at  some  point  in  the  vocal  tract 
and  forcing  air  through  the  constriction  at  a  high  velocity  to  produce  turbulence.  This  creates 
a  broad  sport  ruin  noise  source  to  excite  the  vocal  tract.  The  energy  concentration  is  shifted  to 
the  high  frequency  end  of  the  spectrum  for  unvoiced  sounds,  but  the  spectrum  is  relatively  flat 
when  compared  with  that  of  voiced  speech.  Due  to  large  number  of  random  effects  involved  in  the 
production  of  unvoiced  speech,  (laussian  noise  is  a  valid  candidate  as  the  excitation  source.  This 
assumption  is  validated  by  Wells  [73].  In  his  work,  the  bispectrum  is  used  to  make  V/UV  decision. 
It  ha-  been  found  that  bispectrum  of  English  fricatives  tend  to  zero,  but  for  vowels  the  situation  is 
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Figure  2:  Adjacent  sample  correlation  of  speech  signals. 


just  t  lie  opposite.  A  typical  unvoiced  segment  of  speech  is  shown  in  Fig.  la. 

Voiced  sounds  are  produced  by  forcing  air  through  the  glottis  with  the  tension  of  the  vocal 
cords  adjusted  so  that,  they  vibrate  in  a  relaxation  oscillation,  thereby  producing  quasi-periodic 
pulses  of  air  which  excite  the  vocal  tract.  This  excitation  is  clearly  non-Claussian.  The  energy 
concentration  is  in  the  low-frequency  side  of  the  spectrum  in  the  form  of  a  fundamental  component 
and  its  harmonics.  In  addition,  voiced  sounds  have  more  energy  than  unvoiced  sounds.  A  typical 
voiced  speech  segment  is  shown  in  Fig.  lb. 

For  voiced  sounds,  the  vocal  tract  can  be  modelled  as  an  all-pole  linear  system.  The  same  model 
also  holds  for  unvoiced  sounds  but  the  AH  order  is  less.  Correlation  between  adjacent  samples  is 
high  for  voiced  sounds.  On  the  other  hand,  unvoiced  speech  resembles  white  noise  since  its  spectrum 
is  relatively  flat,  yielding  small  correlation  between  adjacent  samples.  Correlation  sequences  for 
voiced  and  unvoiced  cases  are  illustrated  in  Fig.  2. 

The  differences  in  the  excitation  and  correlation  properties  for  these  two  rases  can  be  used  to 
discriminate  between  them;  however,  with  second-order  statistics  we  ran  only  use  the  correlation 
properties  but  ran  not  utilize  the  information  about  the  excitation  model.  This  motivates  the  use 
of  higher-order  cumulants  of  speech  signals. 

2.2  Our  Approach 

In  t  tie  previous  section,  we  mentioned  t lie  distinctions  between  void'd  and  unvoiced  sounds:  corn' 
lation  among  adjacent  samples  and  excitation  models.  In  this  section,  we  shall  investigate  methods 
that  fully  utilize  this  information. 

linear  prediction  (IF)  methods  are  employed  to  accomplish  our  goal:  however,  we  shall  not  use 
batch-type  methods  for  reasons  outlined  previously.  I. inear  prediction  can  be  based  on  seconder 
higher-order  statistics,  however  the  former  is  usually  employed.  I. inear  prediction  is  essentially 
identifying  the  inverse  of  a  linear  system  driven  by  white  noise;  hence,  it  can  be  considered  as  a 


system  identification  problem.  '1  lie  system  under  consideration  can  be  approximated  by  an  AH 
model,  so  an  I  IR  prediction  filter  will  whiten  the  spectrum  of  the  incoming  signal.  We  shall 
investigate  the  differences  between  cumulant-and  covariance- based  adaptive  prediction  methods  in 
this  section. 

2.2.1  Second-order  statistics  based  adaptive  filtering 

Correlation-based  adaptive  prediction  filters  tend  to  minimize  the  prediction  error  power  at  the 
output  of  the  filter.  Since  correlation  among  adjacent  samples  is  high  for  voiced  signals,  we  can 
remove  a  large  proportion  of  energy  from  the  original  speech  signal  using  prediction.  On  the  other 
hand,  in  the  case  of  unvoiced  sounds,  LP  will  not  be  that  successful  due  to  small  correlation  among 
samples.  Therefore,  a  comparison  of  the  input  signal  power  with  the  power  in  the  prediction  residual 
will  reveal  the  state  of  the  speech  signal. 

Lattice  prediction  filters  enable  monitoring  the  variation  of  prediction  error  power  with  mode; 
order  due  to  their  specific  structure.  Autoregressive  model-order-selection  can  be  performed  by 
selecting  the  tap  which  results  in  minimum  prediction-error  power.  This  leads  to  another  dis¬ 
crimination  between  voiced  and  unvoiced  sounds  .  since  this  order  will  be  relatively  lower  for  the 
unvoiced  case. 


2.2.2  Fourth-order  statistics  based  adaptive  filtering 

In  this  section,  we  shall  investigate  the  behavior  of  a  fourth-order  ruruulant-  based  adaptive  filter. 
An  adaptive  algorithm  for  estimating  the  parameters  of  nonstationary  AR  processes,  excited  by 
non-Gaussian  signals  is  proposed  in  [(>'>].  and  some  modifications  are  suggested  in  [22].  We  used 
the  method  of  [Go],  which  is  in  the  software  package  Ili  -  Spec1  (trademark  of  United  Signals 
and  Systems,  Inc.)  [33].  The  ideas  for  the  covariance- based  filter  directly  apply  to  this  case  with 
one  important  exception:  the  cumulant-based  adaptive  filter  provides  the  solution  to  the  eumulant- 
based  normal  equations,  and  this  solution  is  not  the  one  that  minimizes  the  prediction-  error  power: 
however,  one  may  argue  that  if  the  speech  production  system  ran  be  identified  accurately,  then  the 
prediction  error  should  be  close  to  the  minimum  possible  value. 

With  higher-order  statistics,  we  have  the  diversity  of  using  the  excitation  information:  for 
voiced  sounds  ,  the  excitation  is  non-Gaussian:  hence,  the  speech  production  mechanism  can  be 
identified  by  cumnlant -based  AR  equations.  On  the  other  hand,  for  unvoiced  sounds  the  excitation 
is  Gaussian,  making  the  identification  problem  ill- post <1*  The  cumulant-based  adapt  iv<  filter  trill 
not  he  able  to  identify  the  system  and,  since  then  is  no  associated  outpvt-pom  r  minimization 
criterion,  prediction-error  pouter  may  arbitrarily  increase.  In  this  case,  a  cumulant-based  filter  may 
even  amplify  the  speecli  signal  making  the  power  reduction  by  prediction  comparison  more  clear 
than  when  using  a  covariance-based  method. 

To  validate  our  ideas  about  covariance  and  cumulant-based  adaptive  prediction  of  speech  signals, 
we  performed  some  experiments  using  data  from  the  LIMIT  speech  recognition  database.  Tim 
results  verify  our  claims  and  are  provided  in  the  next  section. 

1  A  cumulant-based  filter  provides  the  solution  of  nmmlaut -based  normal  equations  in  an  adaptive  fashion: 
however,  this  set  of  equations  becomes  trivial  when  i  he  input  to  be  analyzed  i-  a  Gaussian  linear  process, 
because  higher  than  second-order  ciminlants  of  Gaussian  processes  are  zero 
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2.3  Experiments 

Wo  start  our  experiments  by  investigating  the  prediction  performance  of  correlation-and  cumulant- 
based  linear  predictors  in  voiced  speech  case.  An  indication  of  performance  is  the  energy  of 
prediction-error  residual  at  the  output  of  the  filter.  For  this  purpose,  we  selected  a  voiced  speech 
segment  from  the  TIMIT  database  and  performed  adaptive  filtering  based  on  both  correlation  and 
cumulants.  We  expected  that  the  correlation-based  filter  would  yield  better  performance,  since  it  is 
designed  to  minimize  prediction-error  power.  The  original  speech  signal  is  scaled  so  that  estimate 
of  its  variance  is  unity.  The  results  of  this  experiment  are  shown  in  Fig.  3.  Energy  values  reported 
in  this  figure  represent  the  estimate  of  the  variance  of  the  signal  averaged  over  the  data  window. 
Interestingly  enough,  the  cumulant-based  filter  performed  better  than  its  covariance  counterpart, 
although  the  latter  is  designed  to  minimize  the  power  of  the  prediction  residual.  We  repeated  this 
experiment,  with  other  speech  segments  and  in  all  of  the  rases,  cumulant-based  filter  outperformed 
covariance-based  filter. 

In  voiced  speech,  a  conventional  system  identification  approach  for  estimating  the  AR  param¬ 
eters,  using  a  least-squares  fit  procedure,  suffers  due  to  the  nature  of  the  excitation  sequence.  It  is 
known  that,  for  voiced  speech,  the  source  is  definitely  non-Gaussian  ;  it  is  quasi-periodic  in  nature 
with  spiky  excitations.  The  impulsive  nature  of  the  excitation  in  voiced  speech  is  exploited  in  [40], 
by  making  a  Bernoulli-Gaussian  assumption  to  develop  a  multipulse  coding  scheme.  In  [39]  ,  a 
robust  linear  prediction  algorithm  is  proposed  which  takes  into  account  the  non-Gaussian  nature  of 
source  excitation  for  voiced  speech  bv  assuming  the  excitation  is  from  a  mixture  distribution,  such 
that  a  large  portion  of  the  excitation  sequence  is  from  a  normal  distribution  with  small  variance 
while  a  small  portion  comes  from  an  unknown  distribution  of  higher  variance.  Such  a  distribution 
is  called  heavy-tailed  Gaussian.  Based  on  the  above  mixture  model,  a  linear  prediction  algorithm 
is  devised  which  employs  robust  statistical  procedures  (  developed  in  [34]  )  that  operate  in  a  batch 
mode.  Although  satisfactory  performance  is  observed,  the  method  can  not  track  the  transitions 
in  the  input  data.  This  points  out  a  very  important  fact  :  conventional  linear  prediction  can  be 
unsatisfactory  due  to  incorrect  modelling  of  the  excitation.  Of  course,  this  carries  over  to  the 
adaptive  domain,  i.e.,  a  correlation-based  adaptive  algorithm  may  not  be  able  to  yield  the  best 
possible  fit  in  the  presence  of  outliers  in  the  data.  On  the  other  hand,  a  non-Gaussian  excitation 
is  required  by  higher-order-statistics-based  identification  algorithms.  A  cumulant-based  adaptive 
filter  is  able  to  reduce  the  power  in  the  signal  by  effective  prediction,  although  it  is  not  based  on  a 
criterion  for  minimizing  the  power  of  prediction  residual.  Power  reduction  may  be  even  more  than 
that  provided  by  a  covariance-based  filter  due  to  the  just  described  outlier  problem. 

To  analyze  the  behavior  of  adaptive  predictors  in  voiced  and  unvoiced  speech  states,  we  selected 
a  250  msec  period  of  speech  segment  in  which  there  are  two  transitions:  voiced  (0-75  msec),  unvoiced 
(75-190  msec)  and  again  voiced  (190-250  msec).  This  signal  is  shown  in  Fig.  4. 

We  used  an  order  ten  predictor  for  adaptive  filtering  of  the  speech  waveform.  Figure  5  shows 
the  prediction-error  from  a  covariance-based  filter.  Observe  that  an  adaptive  filter  based  on  a 
power  minimization  criterion  will  turn  off  during  the  unvoiced  period:  hence,  this  segment  passes 
undisiorted  through  the  filter.  The  reason  for  this  (as  explained  previously)  is  the  small  adjacent- 
sample  correlation  for  unvoiced  sounds  which  makes  the  process  unpredictable.  To  minimize  the 
output  power. the  filter  turns  off:  however,  during  voiced  segments  deconvolution  is  successful.  We 
observe  a  quasi-periodic  pulse  train  for  the  prediction  residual,  which  is  in  accordance  with  the 
excitation  model  for  voiced  speech  production. 
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Figure  3:  Energy  comparisons,  (a)  Original  speech  signal;  (b)  prediction  residual  from 
covariance-based  fdter;  and  (c)  prediction  residual  from  cumulant-based  filter. 
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Figure  t:  Speech  signal  to  be  used  in  the  experiment:  (a)  first  125  msecs,  (b)  last  125  msecs. 
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Figure  5:  Prediction  residual  from  covariance-based  adaptive  filter:  (a)  first  125  msecs,  (b) 
last  125  msecs. 


Figure  (>  depicts  the  cumulant-based  filter  residual.  During  voiced  periods,  successful  decon¬ 
volution  is  possible  since  the  excitation  is  non-Gaussian.  and  again  a  quasi-periodic  pulse  train  is 
observed  at  the  output  of  the  filter.  Now,  however,  the  filter  amplifies  the  speech  signal  during 
the  unvoiced  segment.  As  explained  before,  during  this  mode  of  operation,  the  system  identifica¬ 
tion  task  is  ill-posed,  and,  since  this  filter  has  no  power  minimization  criterion,  the  power  of  the 
prediction  residual  becomes  higher  than  the  unvoiced  speech  signal. 

To  make  better  comparisons  concerning  the  energy  of  the  original  speech  and  prediction  resid¬ 
uals,  obtained  via  the  two  different  filters,  we  illustrate  the  energy  estimates  in  Fig.  7.  Energy 
is  estimated  by  first  squaring  the  signal  and  then  performing  low-pass  filtering  using  a  15  point 
Hamming  window.  Fig.  7  shows  that,  by  comparing  the  prediction-residual  power  and  the  original- 
signal  power,  it  is  possible  to  make  reliable  V/UV  decisions.  With  the  cumulant-baseu  method, 
('veil  better  results  are  obtained,  because  it  amplifies  the  input  data  during  unvoiced  periods. 

The  observations  from  this  experiment,  validate  our  earlier  statements;  however,  using  a  predic¬ 
tor  may  bring  additional  advantages  as  well.  One  important  by-product  is  pitch  period  estimation. 
Pitch  period  is  the  time  difference  between  the  quasi-periodic  excitation  pulses  during  voiced  speech. 
After  the  V/FV  detection  step,  better  pitch  estimation  is  possible  by  operating  on  the  energy  esti¬ 
mate  of  prediction-residual  rather  than  on  the  original  speech  signal.  From  Fig.  7,  we  observe  that 
the  peaks  in  the  energy  estimate  sequence  are  spaced  by  a  pitch  period  during  voiced  periods  and 
they  are  sharper  than  the  ones  in  the  original  speech  signal  due  to  combined  filtering  and  squaring 
operations.  Consequently,  we  may  apply  the  correlation-based  approach  described  in  [18]  to  the 
energy  estimate  sequence,  for  a  reliable,  simple  but  robust  calculation  of  pitch  period.  In  [18],  pitch 
estimation  is  accomplished  as  follows:  low-pass  filtered  speech  signal  is  quantized  to  three  levels: 
-1.0.1  and  the  correlation  sequence  of  this  quantized  signal  is  obtained.  Covariance  calculation  is 
simple  with  the  quantized  sequence,  since  it  can  be  performed  only  by  addition.  Finally,  a  peak¬ 
picking  method  estimates  the  pitch  period.  Peak-search  is  performed  on  the  possible  range  of  values 
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Figure  7:  Knergy  estimates.  (a)  Original  speech  signal:  (h)  energy  estimate  of  original 
speech  signal:  (c)  energy  estimate  of  prediction-error  residual  from  covariance-based  filter, 
(d)  energy  estimate  ol  prediction-error  residual  from  cumulant, -based  filter. 


2.4  Conclusions 

In  t  li i>  work,  we  showed  that  it  is  possible  to  track  transitions  in  the  state  of  speech  using  adaptive 
linear  predict  ion.  doth  covariance  and  cumulant -based  methods  are  investigated,  and  greater 
contrast  between  Y/l’Y  cases  is  demonstrated  by  the  latter  method  because  cumulants  can  use 
the  difference  in  the  excitation  model  of  the  two  speech  states. 

Pitch-period  estimation  is  also  possible  by  linear  prediction.  Rather  than  operating  on  the 
original  signal,  we  prefer  to  employ  the  prediction-error  residual  available  from  an  adaptive  filter. 
(  umulant  based  approach  operating  on  the  power  estimate  of  the  residual  process  is  shown  to  be 
a  practical  way  ol  pitch  estimation. 

\Y  e  investigated  the  prediction  performance  of  adaptive  predictors  based  on  correlation  and 
cumulants  and  found  that  cumulant -based  prediction  can  outperform  correlation-based  prediction, 
although  the  latter  is  designed  to  minimize  the  power  of  the  prediction  residual.  We  conjectured 
that  outliers  in  the  excitation  model  of  voiced  speech  result  in  this  phenomena.  Better  predic¬ 
tion  performance  obtained  via  cumulants  is  worth  investigating  analytically:  however,  this  is  not 
tractable  with  real  or  synthesized  speech  since  there  are  many  parameters  involved.  Simpler  cases, 
such  as  a  single  sinusoid  in  (iaiissian  noise  can  be  analyzed  to  evaluate  the  performance  of  cumulant 
and  covariance  based  adapt ive-line  enhancers. 
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Con  Goaf 


Kiguiv  S:  Pitch-period  estimation  experiment,  (a)  Original  speech  signal;  (b)  energy  es¬ 
timate  of  prediction-error  residual  from  cumulant-based  filter:  (c)  pitch  contour  obtained 
by  processing  energy-estimate  sequence  using  the  method  in  [18];  and  (d)  autocorrelation 
sequence  of  the  second  voiced  speech  segment  processed  by  the  method  in  [37],  leading  to  a 
gross  error. 

3  Cumulant-Based  Blind  Optimum  Beamforming 

Array  processing  techniques  play  an  important  role  in  enhancement  of  signals  in  the  presence  of 
interference.  A  number  of  books,  and  an  extensive  literature  [13.30-32,42,44,50,64,68]  have  already 
been  published.  Capon’s  minimum- variance  distortionless  response  (MVDR)  beamformer  [8]  has 
been  a  starting  point  lot  both  signal  enhancement  and  high-resolution  direction-of-arrival  (DOA) 
est  imat  ion. 

In  recent  years,  there  has  been  an  increasing  interest  in  high-resolution  array  processing 
techniques  based  on  eigendecomposition  of  the  covariance  matrix  of  received  signals  [4.17,26- 
,27.3li.3N..'>(>.6()-6l.6f).7l].  To  recover  the  signal  of  interest  in  the  presence  of  interfering  signals,  the 
so-called  COPY  function  [58]  is  used.  In  this  procedure.  DOA's  for  all  signals  are  first  estimated, 
and  thmi  the  minimum-variance  processor  that,  reconstructs  the  desired  signal  and  minimizes  the 
i out ribulioii  of  ail  interference  sources  is  implemented.  All  of  the  previously  referenced  methods 
rely  on  completi’  knowledge  of  responses  and  locations  of  array  elements  and/or  DOA  information 
ut  l  h<’  desired  signal. 

If  1 1  it'  array  manifold  is  unknown,  or  there  are  iincertainit.ies.  it  is  then  necessary  to  calibrate 
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I  Ik-  array  [55.72]  :  however.  this  is  not  a  piactical  tiling  to  do.  since  calibration  must  be  done  quite 
frequently,  and.  each  time,  arrav-manilold  information  must  be  stored.  In  addition,  calibration 
sources  may  be  required,  liven  small  errors  in  the  calibration  procedure  may  considerably  degrade 
the  performance.  Sensitivity  analyses  of  high-resolution  methods  and  MVDR  beamforming  have 
been  presented  in  [  1  1 . 12. 11. 16, 2-1-25,29.70. 7(j], 

In  this  study,  we  shall  employ  higher-order  statistics  of  received  signals  to  estimate  the  steering 
vector  of  the  non-Gaussian  desired  signal  in  the  presence  of  directional  Gaussian  interferers  with 
unknown  covariance  structure.  We  assume  no  knowledge  of  array  manifold,  and  DOA  information 
about  the  desired  signal.  Desired  signal  may  be  voiced  speech,  sonar  signal,  radar  return  or  a  com¬ 
munication  signal.  In  our  work,  we  specialize  to  the  communications  scenario,  which  requires  the 
use  of  fourth-order  cumulants.  Following  a  mathematical  formulation  of  the  problem  in  Section  3.1, 
we  describe  blind  estimation  and  optimum  beamforming  procedures  in  Section  3.2. 

Any  estimation  procedure  is  subject  to  errors,  as  is  our  cumulant-based  source  steering  vector 
estimation  method.  In  theory,  cumulants  are  blind  to  Gaussian  noise;  however,  their  estimates  are 
corrupted  by  such  noise.  In  order  to  obtain  satisfactory  results,  longer  data  lengths  are  necessary  in 
cumulant  -based  signal  processing.  To  alleviate  the  effects  of  estimation  error  in  the  beamforming 
step,  we  propose  a  more  efficient  estimation  procedure  that  fully  utilizes  the  data  acquired  by  the 
array.  We  further  suggest  a  method  of  combining  cumulant  and  covariance  information  to  yield 
better  estimates.  I'hen  we  employ  a  robust  beamforming  method  based  on  artificial  noise  injection 
to  combat  mismatch  in  the  source  steering  vector.  We  consider  the  estimation  error  as  a  mismatch 
and  successfully  apply  this  robust  approach  to  our  problem.  These  methods  are  presented  in 
Section  3.3. 

lua  com m n n icat  ions  environment .  mult  i pat  It  propagation  almost  always  take  place.  In  this  case, 
all  eigendecom  posit  ion  -  based  techniques  and  MVDR  fail.  Only  in  some  specific  array  configurations 
is  ii  possible  to  decorrelate  incoming  signals  and  then  estimate  their  DOA's.  We  analyze  the 
behavior  of  our  cumulant  based  approac  h  in  Section  3.3.  We  show  that  our  proposed  approach 
behaves  as  tin  optimum  beamformer  that  maximizes  the  Signal  to  Interference  plus  Noise  Ratio 

(SINK). 

For  real  time  operation  (a  necessary  requirement  in  communications  applications)  we  propose 
an  adaptive  implementation  of  the  cumulant-based  beamformer  in  Section  3.5.  We  then  present 
simulation  experiments  to  indicate  the  performance'  of  our  approach  in  Section  3.6.  Finally,  we 
draw  our  conclusions  in  Section  3.7. 

3.1  Problem  Formulation 

We  formulate  out  problem  in  a  narrowband  fashion.  In  array  processing,  a  problem  is  classified  as 
narrowband  if  the  signal  bandwidth  is  small  compared  to  the  reciprocal  of  the  time  required  for 
the  signal  wavefront  to  propagate  across  the  array.  For  a  discussion  on  bandwidth,  see  [60,63]. 

In  our  formulation,  lower  ami  upper  case  italic  letters  are  used  to  represent  scalars,  lower  case 
bold  font  letters  are  used  for  vectors,  and.  upper  case  bold  font  letters  are  used  for  matrices. 

3.1.1  Signal  Model 

Goiisider  an  array  of  M  elements,  with  arbitrary  sensor  response  characteristics  and  locations. 
A"iniie  I  here  are  ./  ( iaiissian  interference  signals  ]  >,(/).  j  =  1.2 . /  }.  and  a  non-Gaussian 
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desired  signal  d(t).  centered  at  frequency  u\,.  We  assume  sources  are  far  away  from  l. In-  array  so 
that  a  planar  wavefront  approximation  is  possible.  The  additive  noise  present  is  assumed  to  be 
(Iaussian  with  unknown  covariance.  With  these  assumptions,  the  received  signal  at  the  /.t  h  sensor 
can  be  expressed,  as 

J 

can  =  ak(0j)  <l(t)  +  5Z<jt(01;)  /,(/)  +  (1) 

.1=1 

w  here. 

•  0.  :  the  direct ion-of-arrival  of  the  wavefront  corresponding  to  emitter  x. 

•  iiL.(dr)  :  response  of  the  At  la  sensor  to  .rth  signal  wavefront,  including  the  |)hase  factor  asso¬ 
ciated  with  tin'  travel  time  of  the  signal  wavefront  with  respect  to  a  reference  point;  without 
loss  of  generality,  this  point  can  he  taken  as  the  first  sensor  location. 

•  il>  n  :  tin'  desired  nou-t  iaussian  signal  as  received  at  sensor  1.  with  variance  crj. 

•  m/I  :  ilie  y'h  iuterlerer  waveform  as  received  at  sensor  1;  interference  signals  are  assumed 
to  be  independent  of  t  lit'  desired  signal,  and  they  are  (Iaussian  processes. 

•  /;,(/(  :  the  additive  noise  at  the  At  h  sensor. 

I  tpi a t ion  i  1  t  ran  be  rewritten  in  matrix  notation,  as 


wlu're  a(  0.  i  represents  the  Ms  I  steering  vector  for  the  wavefront  from  emitter  x .  which  can  be 
expressed  as 

a(0,)  =  j  «i(0r ).  «2(0,) . <i.m  [#,)  J7  (3) 

We  define  the  iirnui  inunijuld  as  the  collection  of  steering  vectors  over  all  DOA's  of  interest.  Alter¬ 
native  expressions  for  the  received  signal  vector  are. 

r|/)  -  A  z( /)  +  n( / )  =  a [O.j)  d(t)  +  Aj  if / )  +  n (t)  (4) 

In  t hi-  last  expression,  we  partitioned  the  Ms(J  +  1)  steering  matrix  A  as. 

A  =  [  a (tf.i).  Aj  (5) 

where  the  \lx.l  matrix  Aj  is  the  steering  matrix  for  interference  sources. 

In  this  paper,  we  address  the  problem  of  optimum  beamforming  with  an  array  of  sensors  whose 
i es pi >n m's  and  locations  are  completely  unknown:  hence,  although  we  may  have  a  priori  knowledge 
about  t’ne  d  i  red  ion  -of-  arri\  al  of  desired  signal,  we  can  not  perform  beamforming  due  to  the  lack  of 
knowledge  of  at  ray  manilohl.  In  [23;,  this  problem  is  addressed:  however.  [23]"s  algorithm  is  limited 
io  a  single  interference  'igual.  We  investigate  the  possibility  of  a  more  general  solution:  namely, 
'tuna!  recovery  in  t  lie  presence  of  mult  iplo  interferers  whose  correlation  st ructure  is  unknown.  Before 
presenting  oin  approach,  which  emp|o\s  higher-order  statistics,  we  demonstrate  the  limitations  of 
o\  a  i  ia  in  e  based  an<i\  [uoiessinu  ha  this  problem. 
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3.1.2  Covariance-Based  Approaches 

Currently  used  high  resolution  methods  of  DOA  estimation  and  minimum-variance  distortionless 
response  beamforming  (MVI)R)  employ  the  covariance  matrix  of  signals  received  by  the  array.  The 
wavefront  covariance  matrix,  S,  is  defined  as  the  covariance  of  the  source  signals  as  received  at  the 
reference  point,  i.e.,  at  sensor  1: 

S  =  £  {z{t)z"(t)}  (6) 

where  (-)11  denotes  complex  conjugate  transpose.  Using  the  received  signal  model  in  (4),  we  can 
express  the  MxM  covariance  matrix  R.  of  array  measurements  in  the  following  two  ways: 

R  =  f{r(t)  r"(t)}  =  ASA"  +  Rn  =  a( 6d )  a" (0d )  +  Ru  (7) 

where  R,,  is  the  noise  covariance  matrix, 

R„  -  f  {  n(0  n "(/)  }  (8) 

on  d.  R  ,,  is  the  covariance'  matrix  of  the  midesired  signals,  i.e., 

R„  =  £  {  [  A[  ii /)  4  n( 0  j  [  Ai  iff)  4-  n(0  ]"  }  (9) 

In  general,  the  nois(>  covariance  matrix.  R„.  is  unknown.  With  some  restrictions  on  array  ori¬ 
entation  and  noise  covariance  structure,  some  approaches  for  high  resolution  DOA  estimation  are 
proposed  in  [47.52]  that  do  not  require  this  information;  however,  these  techniques  have  their  limi¬ 
tation.''  due  to  involved  assumptions.  Kven  with  complete  knowledge  of  noise  covariance  structure, 
source  localization  is  still  impossible  without  the  knowledge  of  array  manifold.  In  [56],  ESPRIT 
algorithm  is  devised  to  overcome  this  problem;  however.  ESPRIT  requires  transitionally  equiv¬ 
alent  subarrays  with  known  displacement  vectors,  which  may  also  be  impractical  due  to  all  the 
constraints  on  array  orientation.  In  [21].  an  eigondeeomposition-based  beamforming  approach  is 
proposed  which  assumes  the  identifiabilit y  of  the  signal  subspace  and  availability  of  the  steering 
vector  information  for  the  signal  of  interest.  CJood  results  were  obtained  under  these  assumptions; 
however,  this  method  can  not  handle  coherent  interference  and  spatially  colored  noise. 

In  [()■  10.57]  .  blind  estimation  of  steering  vectors  for  independent  emitters  is  discussed  with  the 
following  conclusion: 

Mind  estimation  of  source  steering  vectors  is  not  possible  with  only  second-order 
statistics,  but  employing  higher-t han-second-order  cumulants,  it  is  possible  to  estimate 
source  steering  vectors  up  to  a  scale  factor. 

MVDR  beamformiug  is  an  alternate  approach  for  signal  recovery.  This  approach  however, 
inquires  knowledge  of  i  lie  steering  vector  for  the  desired  source  up  to  a  scale  factor  and  uses  the 
covariance  matrix  R.  of  received  signals  for  processing.  The  output  of  the  MVDR  beamformer  y{f) 
can  be  expressed  as  [*] 

!l(l)  w"r( /)  =  [  ;f|  R” 1  a(fl,<)  ]/;  r( / )  (10) 

w  Imre  the  constant  -1\  is  present  to  maintain  a  specifier!  response  for  the  desired  signal  and  w 
denotes  the  weight  vector  of  tin'  processor. 

I  loin  the  above  expression,  it  is  clear  that  MVDR  beamforming  requires  knowledge  ofaftf,/). 
Without  knowledge  of  array  manifold,  it  is  not  possible  to  determine  a((9,y)  even  in  the  case  of  known 
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Hi.  therefore.  MVDR  beamforming  can  not  be  directly  applied  to  our  problem.  In  addition,  the 
M\  DU  beam  former  is  <|iiite  sensitive  to  errors  in  assumed  sensor  locations  and  characteristics  [11- 

12.1  l.2‘>.70.7(i[. 

In  many  applications,  multipath  propagation  takes  place  resulting  in  coherent  sources.  Coher¬ 
ence  presents  a  serious  problem  to  DOA  methods:  it  leads  to  a  singular  source  covariance  matrix 
S.  Ibr  which  it  is  not  possible  to  estimate  source  locations  except  in  some  specific  array  configura¬ 
tions  [  IS  l-()2.(»(i.  7  1.75].  In  the  MV  DR  case,  source  coherency  does  not  represent  a  problem  as 
long  as  there  is  no  source  correlated  with  the  desired  signal;  however,  this  situation  is  rarely  met 
in  practice.  In  general,  the  desired  signal  is  subject  to  multipath  propagation,  and  performance 
of  MVDR  approach  degrades  severely  [bl.7S],  An  optimum  beamforming  procedure  has  been  sug¬ 
gested  in  [(>]  to  overcome  the  coherence  problem  by  using  a  linear  array  of  elements  with  identical 
direct ional  characterist ics. 

We  tire  t  herefore  looking  for  a  met  hod  t  hat  can  overcome  all  t  lies*'  problems.  In  the  next  section, 
we  present  an  approach  that  accomplishes  this  by  combining  cum idant- based  blind  estimation  and 
MVDU  beamforming. 

3.2  Cumulant-Based  Optimum  Beamforming 

In  l  lie  previous  section,  we  discussed  the  problem  of  optimum  beamforming  and  concluded  that  it 
is  not  possible  to  recover  a  desired  signal  in  the  presence  of  multiple  interferers.  unknown  sensor 
noise  covariance,  multipath  propagation  and  without  any  information  about  array  manifold.  In 
this  section,  we  propose  a  method  to  overcome  these  problems.  We  propose  a  two-step  procedure: 
highei -order-st at ist ics  for  blind  estimation  of  the  source  steering  vector,  followed  by  MVDR  beam- 
lorming  based  on  second-order  statistics  of  received  signals  and  steering  vector  estimate  provided 
by  t  In'  fi  rst  step. 

3.2.1  Estimation  of  desired  signal  steering  vector 

In  this  section,  we  employ  cumulants  of  received  signals,  to  estimate  the  steering  vector  of  the 
desired  signal  up  to  a  constant  factor.  Third-order  cumulants  are  blind  to  signals  with  symmetric 
probability  density  function.  On  the  other  hand,  most  signals  in  communication  environments  have 
symmetric  density  functions,  which  motivates  the  use  of ’fourth-order  cumulants-.  First,  we  define 
the  Jimrtlt-orih  r  :<  lo-lnii  riuinilmil  operator  <  >1  complex  processes  {■''  \(l ).  r  ^(t ).  ./■•»(  t ).  .r.i(  t )} .  as 

<11111  {  .1  |  I  /  ).  ./_,( /  ).  /  ,(  /  i.  .r  ,|  /  )  }  ^  A  {  /,(/  I.Cjjt  )./•;,(/ ).r  ,( /)}  -  A  \-l  l(t  ).Cjl  / )}  A' 

-  A  {.'!(/  I.c  (|  Ml  A  [./■_.(/ )./.(/ )}  -  A  {  c  I  ( /  ).<  It  / ) }  A  {./ _•(  0  I  >.(0}  (II) 

Next .  consider  the  vector  c  -  [c|.c_, . c )/ j 1  .  defined  as 

n  =  <<m,\r,U).r[l[t).r[l{l).rl(l)}  1=  1.2 . \l.  (12) 

As  suggested  in  [Id],  there  are  various  ways  ol  defining  fourth-order  statistics  of  complex  random 

processes.  \\e  follow  the  approach  presented  in  [bl;  in  (  12).  Sim<'  interference  signals  are  indepen¬ 

dent  ol  the  desired  signal  and  they  are  <  iaiissian  with  zero  lour t  h  order  c  uimilaiits.  we  can  express 

\  1 1  '  >i  nii.it  i<>n  |  >i  •  I  ii  i  •  bi-.d  ini  t  bird  o|-.|i-i  statistics  is  1 1  resell  I  e«  |  in[l’.l] 
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<7  --  {« , ( ).K.,(  /  ).  ./"(ft/ ).k'/ ( /  ).  «"( tf., )-•<//(/ ).  «,( )^(/( /  )  }  (13) 

I'sing  propert ios  ol  vumulants.  we  obtain 

>1  =  i"i(ft/)|'  "i^ft/)  ' 7.1  "/(ft/)  (14) 

where  *,/.i  denotes  tin*  zi  mill  lag  of  tin*  fourth-order  nimulant  of  tin*  desired  signal.  Defining 
.j,  -  j ry i ( ft/ ) | 2  we  have  tin*  following  expression  for  the  .\/xl  vector  c: 

c  =  fj  a(  H.j )  ( 15) 


Observe  that  t  lie  vector  c  is  a  n  plica  of  tin  x,n  ring  n  dor  of  tin  di  xirt  d  signal  up  to  a  scale  factor. 
Wo  show  in  the  next  section  how  this  information  can  be  used  to  recover  the  desired  signal. 

3.2.2  Interference  Rejection 

W  ith  the  knowledge  of  the  steering  vector  of  tin*  desired  signal,  interference  rejection  is  possible 
using  the  following  minimum-variance  distortionless  response  formulation:  find  the  weight  vector 
w  that  minimizes  tin*  power,  w11  R  w.  at  the  output  of  the  beamformer  subject  to  the  constraint 
w11  c  =  1 .  where  c  is  obtained  via  t  In*  cumulaiil -based  estimation  procedure  described  in  Sub¬ 
section  3.2.1.  r he  solution  to  this  optimization  problem  is  well-known  [s],  and  can  be  expressed 
as 

w  .f-t  R_l  c  (16) 

where  the  constant  i. t  -  ( c 11  R  1  c)  1  is  present  in  order  to  maintain  the  linear  constraint. 

Due  to  tin*  constraint  c  —  1.  the  power  minimization  procedure  does  not  cancel  tin*  desired 
signal,  but  rejects  all  interference  components  and  sensor  noise  in  the  best  possible  manner.  Note 
that  this  is  accomplished  without  knowledge  of  covariance  structure  of  interference  signals,  sensor 
noise  or  array  manifold.  In  tin*  sequel,  we  refer  to  the  processor  in  (16)  as  CUM].  The  proof 
that  this  cumulani  based  beamformer  is  identical  to  the  maximum  SINK  processor  is  provided  in 
Section  3.1.  where  the  general  multipath  case  is  treated. 

3.3  Robust  Beamforming 

In  this  section,  we  first  propose  an  approach  that  utilizes  the  received  data  in  the  estimation  of 
tin*  source  steering  vector  in  a  more  efficient  manner.  We  then  suggest  a  method  that  uses  both 
i  ii  m  u  I  a  1 1 1  s  and  covariance  mtormaiioii  under  some  scenarios,  finally,  we  employ  a  robust  method 
to  comb, it  I  lie  elici  ts  o|  estimation  errors. 

3.3.1  Efficient  Etilization  of  Array  Data 

In  tin*  previous  section,  we  presented  a  method  of  blind  estimation  of  the  desired  source  steering 
\eitor  I  rom  the  received  data:  however,  the  proposed  approach  is  rat  her  inefficient,  in  the  sense  that 
only  tin*  tii-'t  sen.soi  is  taken  as  reference.  |or  example*,  if  t  lie  connect  ion  from  this  element  to  the 
pji  m  es>i  ir  is  broken .  then  t  lie  rsi  i  nia  I  ion  i  ib  je<  t  i  ve  can  not  be*  accomplished.  Similarly,  due  to  poor 
receiving  circuit  r>  following  this  arra\  element,  the  reference  signal  may  be  very  noisy,  degrading 
t  fie  <  pi  a  fit  v  ol  t  In  -  esi  i  mate  We  can  overt  i  une  i  hose  difficult  ies  by  using  mult  i  pic*  reference  elements. 
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ibiine  t  lie  matrix  C  wit  h  the  (A\/)th  element, 

(\.i  =  <uni{rk(t).rll{t).rjf(t).rt(t)}  where  kj  =  l - - M.  (17) 

W  ith  tnit>  statistics,  the  eross-eumulant  matrix  C  will  have  rank  1,  since  all  its  columns  are  scaled 
repli.  as  of  the  desired  source  steering  vector;  however,  with  sample  statistics  this  condition  never 
holds.  I  he  left  singular  vector  of  C  with  the  largest  singular  value  can  be  used  as  the  estimate  of 
the  desired  source  steering  vector  removing  the  effects  of  noise.  In  this  way,  we  utilize  array  data 
more  efficiently*.  The'  beam  former  that  employs  the  steering  vector  estimate  obtained  in  the  way 
described  above  is  referred  to  as  the  (TMj  beamformer  in  the  sequel. 

In  addition,  the  Total  Least  Squares  algorithm,  that  takes  the  errors  in  both  the  received  data 
covariance  matrix  estimate  and  the  steering  vector  estimate  into  account,  is  a  better  choice  for 
computing  the  optimum  weight,  vector,  as  suggested  in  [78],  but  it  is  computationally  expensive. 
If  extra  computations  are  feasible,  we  suggest  the  use  of  the  Constrained  Total  Least  Squares 
algorithm  [l],  for  even  better  numerical  results. 

3.3.2  Covariance-Cumulant  (C2)  Approach 

In  some  array  processing  applications,  sensor  noise  covariance  structure  has  a  definite  structure 
enabling  a  whitening  operation  on  the  received  data.  The  principal  eigenvectors  of  the  covariance 
matrix  of  this  processed  data  reveal  the  subspace  spanned  by  the  steering  vectors  of  directional 
signals  illuminating  the  array  [r».x] .  Hence,  the  steering  vector  estimate  obtained  by  the  cumulant- 
ba.sed  approach  can  be  improved  by  projecting  this  estimate  on  the  subspace  spanned  by  the 
principal  eigenvectors  of  the  covariance  matrix.  This  improved  estimate  can  the  n  be  used  in  the 
beamforming  procedure  of  Section  $.2.2.  The  motivation  behind  this  approach  is  that  covariance 
estimate's  exhibit  less  variance  than  e  umulant  estimates,  but  in  the  covariance  domain  we  can  not 
ident  ifv  the  source  steering  vector  if  t  here  are  multiple  sources.  This  procedure  yields  an  estimate  of 
tlm  steering  vector  from  covariance- mat rix  information  by  employing  the  cumulant -based  estimate 
as  side  information.  A  mathematical  description  of  this  approach  is  presented  below: 

1.  from  the  received  da  a.  estimate  the  covariance  matrix  R  and  the  desired  signal  steering 
vector  c  by  t  lie  c  umulant  -  based  procedure. 

2.  I’ertorm  an  eigendec (imposition  of  the  sample  covariance  matrix,  to  reveal  the  signal  and 
noise  sii bspac os:  t  lie  eigenvectors  of  R  with  the  repeated  minimum  eigenvalue'  span  the  noise 
subspare  josj.  while  the  rest  span  the'  signal  subspace. 

t.  \ ssii me  the  signal  subspace  is  (./  +  !)  dimensional.  Then,  the  basis  vectors  for  the  signal 
siibspare.  obtained  from  the  eigendec omposit km  procedure,  can  be  sorted  in  an  Mx(J  +  1) 
matrix  E,  with  the  column  space  identical  to  the  signal  subspace. 

I  I’roject  the  cumulant -based  steering  vector  estimate  c.  on  the  signal  subspace  to  obtain  an 
improved  Cst  i  Ilia  I  e  C,  .  .  as 

=  E,E"c 

■’>.  Compute  the  weight  s  lor  the  ben  m  former.  as 

XV  iriif  R.  c,,,,,, 

’  \  lie  I  III  id  lllal  lltlll/es  !  Ill  ,111.1V  data  even  mole  efthie||ll_\  is  presented  Ill[l9j. 
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3.3.3  Robustness  Constraint 


Any  estimation  procedure  is  inevitably  subject  to  errors.  MVDR  beamforming  is  extremely  sensitive 
to  mismatch  [l  1  12,1  1.16,29,70,76],  especially  in  high  SNR  conditions  and  in  arrays  with  large 
number  of  elements.  A  variety  of  constraints  have  been  summarized  in  [68]  assuming  perfect 
knowledge  of  element  characteristics  and  locations;  however,  in  our  case  these  methods  are  not 
applicable  since  there  is  no  available  information  about  the  array  manifold  to  design  effective 
const  raints. 

Errors  in  the  steering  vector  estimate  result  in  signal  cancellation.  This  mismatch  condition, 
arising  from  non-perfect  estimation,  can  be  viewed  as  the  problem  of  optimum  beamforming  with 
an  array  of  sensors  at  slightly  perturbed  locations.  In  [15],  a  method  that  constrains  the  white 
noise  gain  of  the  processor  is  proposed  for  the  solution  of  the  latter  problem.  In  this  section,  we 
use  the  same  approach  to  alleviate  the  effects  of  estimation  errors  in  cumulant- based  optimum 
beamforming. 

In  order  to  understand  the  mismatch  problem  and  find  a  way  to  alleviate  its  effects,  we  need 
io  analyze  the  problem  analytically.  Consider  the  power  response  of  a  beamformer  with  a  weight 
vector  w.  as  a  function  of  DOA  9.  defined  as 

P(0)  =  \wHa(6)\2  (18) 

with  a(9)  denoting  the  steering  vector  for  an  arrival  from  9.  The  derivative,  dP(0)/ 06,  can  be 
expressed,  as 

— t jjp-  ---■  w"a(#)  [  w'' «/W(0)  ]}  d9) 

Now  consider  the  following  scenario:  we  have  an  MVDR  processor  looking  at  0O.  which  is  the 
expected  DOA  for  the  desired  signal.  Instead,  the  source  illuminates  the  array  from  which  is 
very  close  but  not  equal  to  9„.  In  this  case,  the  beamformer  treats  the  desired  signal  as  interference 
and  nulls  it:  however,  due  to  the  distortionless  response  constraint  for  60,  and  since  the  angles  are 
very  (lose,  the  derivative  <JP(9)/dO  must  be  large  in  magnitude  for  6  between  9  4  and  0o.  From 
the  derivative  expression  (  19).  it  is  clear  that  this  is  possible  only  if  the  norm  of  the  weight  vector 
increases,  since  the  inner  product.  wwa(0).  and.  the  derivatives.  { aff(0)  }fJ]  are  bounded.  In 
this  situation,  the  constraint  is  maintained  by  increasing  the  angle  between  the  weight  vector  and 
the  look-direction  steering  vector.  This  phenomena  was  exploited  in  [77].  for  tuning  the  beamformer 
to  acquire  a  weak  desired  signal  in  the  presence  of  strong  interference. 

Note  that  the  white- noise  amplification  factor  for  any  processor  with  a  weight  vector  w  is  www: 
hence,  the  nulling  phenomena  can  lie  prevented  if  the  white  noise  level  at  the  processor  is  sufficiently 
high  so  that  output  power  minimization  criterion  limits  the  increase  in  the  norm  of  w.  This  can  be 
achieved  by  perturbing  the  covariance  matrix  estimate  of  array  measurements  by  a  scaled  identity 
mat  ri\  as. 

R,,  =  R  +  cI  (20) 

where  <  is  a  non -negative  parameter  which  adjusts  the  strength  of  perturbation.  Alternatively,  it 
is  possible  to  coin  a  term  rirhnil  SNR.  SNR,,  defined  as 

SNR,  =  SNR  -  10  log,, ,  ,  (21) 

Of. 
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We  tlit'ii  tlt'U'niiiiit'  tlu'  weight  vi'ctur  as. 

w  =  R“ 1  a(  0O )  (22) 

A  recent  method  presented  in  [If,]  per lor ii is  this  procedure  in  an  adaptive  fashion  by  a  simple 
scaling  of  tin'  weight  vector.  In  our  case,  we  do  not  have  source  I)()A  information,  but  we  do  have 
an  estimate  of  the  steering  vector.  It  is  t  herefore  possible  to  ust'  this  estimate  in  place  of  a(0o)  in 
(22)  to  formulate  the  cumulaut-based  processor  with  limited  signal  nulling  property. 


3.4  Multipath  Phenomena 

Kigendecom  posit  ion- based  high-resolution  methods  [4.17,2G-27,3(i,38,56.G0-(il,69,7l]  have  proven 
to  be  effective  means  of  obtaining  bearing  estimates  of  far- field  narrowband  sources  from  noisy 
measurements.  The  performance  of  these  algorithms  is  severely  degraded  when  coherence  is  present. 
Several  methods  have  been  proposed  to  solve  the  coherent  signals  problem  with  restrictions  on 
array  geometry  [  IS-  I9.GI-G2.GG. 7  1.75]:  however,  with  lack  of  knowledge  of  array  manifold  it  is  not 
possible  to  solve  the  coherence  problem.  MV l)R  beamforming  also  fails  to  perform  optimally, 
when  interference  signals  an*  correlated  with  the  desired  signal  [5-1. 7S],  In  some  scenarios,  even 
the  conventional  beamformer  out  perforins  the  MVDR  approach  due  to  signal  cancellation  in  the 
M V DR  beamformer. 

In  Section  it. 2.  we  showed  that  the  cumulaut-based  beamformer  is  not  affected  by  the  presence 
ol  coherence  among  interfering  (iaussian  signals  as  long  as  they  are  not  correlated  with  the  desired 
signal.  The  same  is  not  possible  for  high-resolution  DOA  estimation  methods;  but.  the  MVDR 
beamformer  may  perforin  equally  well  if  t  lie  desired  signal  steering  vector  is  known  and  a  satisfactory 
estimate  of  R  is  available.  In  t  his  section,  we  show  that  the  cutnulant-based  approach  is  not  affected 
In  the  presence  of  multipath  propagation  of  the  desired  signal.  In  addition,  we  show  that  the 
i  iiiiiiihuit-bax  (I  i>n><(  *si>r  turn s  out  to  lx  tin  maximal- rat  io-eombim  r  [5]  that  maximize a  the  SINR. 

\\  it  h  t  lie  presence  of  mult  i pat  h  propagation  or  smart  jamming,  our  signal  model  in  ( l )  changes 
to 

l.  ./ 

=  ak(f){i, )  ill  +  X!  M#,, )  './(,)  +  MO  (23) 

1=1  ./  =  ! 

or  in  vector  form 


ft/) 


r 

'/i 

at  b/. 

>-  a( «.,.).  • 

•  •  ■  a (0,i,  ) 

>12 

- 

- 

.  'U-  . 

<l{t)  +  Aj  i(i)  +  n( / ) 


(24) 


where  the  set  ol  scalars  ]  //, .  i/: . ///  }  constitute  the  multipath  coefficients  for  an  L- rav  scenario. 

I  he  set  of  vectors.  {  a (#/.  ).  a (0./  I . a (W.;,  )  }  are  the  corresponding  steering  vectors  of  the 

I  ray  model,  betting 
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(25) 
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we  can  reduce  the  signal  model  for  multipath  phenomena  to  the  single-ray  propagation  model  of 
Section  3.1.1. 

r (/)  =  b  d(t)  +  A\  i(t)  +  n(t)  (26) 

because  we  can  view  the  vector  b  as  a  generalized. steering  vector  for  a  single  desired  signal  although 
it  may  not  be  a  vector  in  the  array  manifold.  Therefore,  following  our  work  in  Section  3.2,  cumulant- 
based  blind  estimation  procedure  will  yield 


c  =  l34  b  (27) 

where  d4  =  |h]|2  b {*  'id, 4,  in  which  hi  is  the  first  component  of  b.  Incorporating  (27)  into  the 
constrained  power  minimization  procedure,  we  obtain  the  following  weight  vector, 

w,.um  =  ,i5  R_1  c  =  /J4/J5  R-1  b  (28) 

where  =  (  cH  c  )-1. 

Next,  we  find  an  alternate  expression  for  wt-„m.  Recall  that  the  optimization  problem  which 
results  in  wL.UIU  is:  minimize  wHRw  subject  to  w^c  =  1,  or  by  (27),  wwb  =  1  / /34 .  We  can  express 
the  output  power  in  the  following  wav  by  using  (9)  and  (26), 

w"Rw  =  <rj  |  wH  b  |2  +  wWRuw  (29) 

but.  due  to  the  constraint  w^b  --  1  / J4,  the  first  term  in  the  above  expression  is  a  constant. 
Therefore,  the  original  optimization  problem  can  be  translated  into  :  minimize  wwRuw,  subject 
to  wHc  —  1  or  equivalently,  w^b  =  1  /;iA.  The  solution  to  this  problem  is 

w,- .  =  R71  c  (30) 

where  .in  =  (cH  R  '1  c)~'.  Of  course,  this  solution  can  also  be  expressed  in  terms  of  b,  as 

W,.„m  =  J7  R,;1  b  (31) 


where  ,i 7  -  ;. 

Note  that  although  (30)  and  (31 )  are  alternate  expressions  for  w,„m.  they  are  not  the  way  to 
actually  compute  w,  since  R„  is  not  available  in  general. 

Next,  we  determine  the  weight  vector  that  yields  the  maximum  SINR.  SINR  can  be  expressed 
as  a  function  of  the  weight  vector  of  the  beam  former,  as 


SINR(w) 


ww  b  bH  w 
ww  Ru  w 


Defi 


mug. 


R 


1/2 


w 


so  t  hat 


w  = 


R,71/2  v. 


we  can  reexpress  (32),  as 


SIN  H  ( w  )  —  SINR  (  R,; 


it 


V)  = 


<*4- 


,H 


r!/2  b 


rH 


(32) 


(33) 


Applying  the  Schwarz  inequality  [50]  to  (33).  we  find  that 


SINK  (w)  =  SINK  (  R,7,/2  v)  <  rrfj  ||  R”1^2  b  ||2  =  ^bWR,7'b 


(34) 
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where  equality  holds  if  and  only  if 

v  —  tig  R-^2  b  (35) 

in  which  is  a  non-zero  constant.  Consequently,  the  optimum  weight  vector  wsinr5  which  yields 
the  maximum  SINK,  can  be  determined  from  w  =  Ru  '  v  and  (35),  as 

wsinfc  =  $8  R-u  1  b  (36) 

Based  on  this  derivation,  some  comments  are  in  order.  It  is  clear,  by  comparing  (31)  and  (36), 
that  the  cumulant- based  beamformer  does  indeed  yield  the  maximum  possible  SINR,  since  wcum 
is  just  a  scaled  version  of  wsinr-  This  observation  proves  that  the  cumulant- based  beamformer  is 
optimal.  In  addition,  wcum  can  be  computed  from  the  received  data,  whereas  wsinr>  as  imple¬ 
mented  in  (36),  requires  knowledge  of  R„,  which  can  not  be  determined  from  the  received  data  in 
the  presence  of  the  desired  signal.  Finally,  note  that  robust  approaches  presented  in  Section  3.3 
are  directly  applicable  in  the  presence  of  multipath. 


3.5  Adaptive  Processing 

In  real-world  applications,  adaptive  beamforming  is  an  important  requirement,  especially  when  the 
desired  signal  source  is  in  relative  motion  with  respect  to  the  array.  In  this  section,  we  address  this 
problem  by  providing  an  "estimate  and  plug"  type  of  adaptive  algorithm  for  the  CUM]  method. 

The  beamforming  procedure-  (  16)  requires  the  inverse  of  the  sample  covariance  matrix  to  com¬ 
pute  the  weights.  YVe  can  estimate  the  covariance  matrix  recursively,  as 

R,  =  ( 1  -  o,  )R(_i  +  air{t)rH(t)  (37) 


Since  we  need  to  propagate  the  inverse  of  R(,  we  use  the  Sherman- Morrison  formula  [46],  to  obtain 


R-r1  - 


1-0! 


[R,_i  -  ci , 


Rt-VfQr^fQR,,1! 

1  -  o,[l  -  r"(ORr-\r(/)] 


t  =  1,2. 


(38) 


w  it  h  R0  1  =  7 1  where  '  is  a  large  positive  number  and  oi  controls  the  learning  rate  for  second-order 
statist  ic>. 

To  compute  the  weight  vector,  we  also  need  the  cumulant-based  estimate  of  the  source  steering 
vector  c.  We  can  estimate  it  recursively  as 

ci(t)  =  (1  -o  ,)(!(!  -  1 )  +  o2[  \ri(t)\2r?(t)r,(t)-2p(t)q(t)-  vH (t)x(t)}  (39) 


with  the  auxilary  processes  defined  as 

P(H  =  ( 1  -  o  ,)/>(/  -  1 )  +  o.q/  |(/)|2 

</(')  =  (  1  -  o  3  )c/(  /  -  1)  +  o3r"(  On(f) 
c(/)  =  (l—  »  | )e(  t  -  1 )  +  o3r2(/) 

•fin  =  11-  o 3)x(t  -  1)  +  o3ri(nr/in 

I  lie  auxiliary  processes  are  required  in  order  to  implement  the  cross-correlation  terms  in  (11).  The 
initial  values  for  the  auxilary  processes  ran  be  set  to  zero.  Different  learning  rates  are  provided 
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to  emphasize  the  fact  that  higher  order  statistics  require  longer  periods  to  acquire  the  required 
information. 

We  can  perform  adaptive  beamforming  by  computing  the  weight  vector  at  each  time  as 

w  (t)  =  Rj~lc(t)  (40) 

and  obtain  the  array  output,  as 

y(t)  =  wH(t)r(t).  (41) 

Adaptive  versions  of  CUM2  and  C2  methods  will  appear  in  a  later  publication. 

3.6  Simulations 

In  this  section  we  present  various  experiments  to  illustrate  the  performance  of  cumulant-based 
beamforming.  In  all  of  the  experiments  we  employed  a  uniformly  spaced  linear  array,  rather  than  an 
arbitrary  geometry.  This  is  done  for  two  reasons:  Covariance- based  techniques  are  mainly  designed 
for  this  type  of  array  structure,  e.g..  t he  spatial  smoothing  algorithm  [48-49,61-62,66,74,75],  so  that 
it  will  be  possible  to  compare  both  previous  and  future  work  with  our  current  results.  In  addition, 
allowing  a  sufficient  number  of  multipath  rays,  it  is  possible  to  represent  any  arbitrary  steering 
vector  by  the  linear  array,  since  the  steering  vectors  of  the  uniformly  spaced  isotrooic  linear  array 
exhibit  Vandermonde  structure,  resulting  in  linearly  independent  vectors  for  different  DOA’s.  In 
all  batch  type  of  experiments,  the  record  length  is  1000  snapshots  and  the  array  has  10  isotropic 
elements  with  uniform  half- wavelength  spacing. 

3.6.1  Experiment  1:  Desired  Signal  in  White-Noise 

In  this  experiment,  we  employ  the  linear  array  described  above  for  optimum  reception  of  a  BPSK 
signal,  which  is  expected  to  arrive  from  broadside  in  the  presence  of  temporally  and  spatially  white, 
equal  power,  circularly  symmetric  sensor  noise;  however,  the  desired  source  illuminates  the  array 
from  5"  broadside. 

Our  first  MVDR  beamformer.  MVDRj.  looks  to  broadside,  i.e.,  a  mismatch  condition.  Our 
second  MVDR  beamformer,  MVDR2.  uses  exact  knowledge  of  DOA  of  the  desired  signal.  We  also 
employ  the  cumulant-based  beamformer  of  Section  3.2,  C'UMj,  and  the  improved  cumulant-based 
beamformer  C TM2  of  Section  3.3.1.  We  investigate  the  performance  of  these  processors  for  the 
following  two  elemental  SNR  levels:  20  dB  for  a  strong  signal  and  0  dB  for  a  weak  signal.  Note 
that  the  white-noise  gain  of  any  processor  is  limited  to  10  dB  by  the  number  of  sensors  [15]. 

The  beampattern  responses  (IS),  and  white-noise  gains  of  these  beamformers  are  presented 
in  Fig.  9  for  SNR  =  20  dB.  All  responses  are  normalized  to  have  a  maximum  value  of  0  dB.  For 
comparison  purposes,  the  optimum  beamformer  response,  calculated  by  using  true  statistics  in  (16), 
is  presented  as  the  dashed  curves.  Observe  that  due  to  the  mismatch  condition.  MVDRj  nulls  the 
desired  signal.  More  interestingly,  the  MVDR2  processor  that  utilizes  the  true  DOA  information 
does  not  improve  the  SNR.  due  to  the  mismatch  arising  from  the  use  of  a  sample-data  covariance 
matrix.  The  cumulant-based  processors.  CUM]  and  CUM2,  yield  excellent  performance  without 
any  knowledge  of  source  DOA.  It  is  very  important  to  observe  that  the  performance  of  cumulant- 
based  processors  an  better  than  that  of  the  .1/1  DR  with  exactly  known  look-direction. 

We  performed  100  Monte-Carlo  runs  to  investigate  the  performance  in  a  better  way.  The  results 
are  given  in  Fable  1 . 
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Figure  I  I:  Power  of  cumulant-based  beamforming:  (a)  received  signal  at  the  reference  ele¬ 
ment  at  SNR  =  0  dB,  (b)  output  of  CUM?  processor. 


From  these  results,  it  is  clear  that  cumulant-based  processors  are  superior  and  the  extra  compu¬ 
tation  involved  in  CUM*  reduces  the  variations.  Note,  also,  that  variations  in  the  MVDR  processors 
are  significantly  larger  than  those  of  the  cumulant-based  counterparts.  This  agrees  with  the  previ¬ 
ous  remarks  about  the  sensitivity  of  MVDR  processing  to  experimental  conditions  in  a  high-SNR 
environment. 


Table  1:  Results  from  100  Monte-Carlo  Runs  for  Experiment  1 


Processor 

White-Noise  Gain  (dB) 

SNR=20dB 

SNR=0dB 

Mean 

Std. 

Mean 

Std. 

MVDR, 

-38.130 

1.579 

0.413 

0.281 

MVDR* 

0.179 

1.360 

9.583 

0.131 

CUM, 

9.954 

0.015 

9.058 

0.359 

CUM* 

9.990 

0.003 

9.959 

0.014 

We  performed  the  same  experiment  for  0  dB  SNR  condition.  Figure  10  illustrates  the  beam- 
pattern  responses  and  white-noise  gains  of  the  processors.  Monte-Carlo  results  are  also  given  in 
fable  1.  In  this  low -SNR  condition.  MVDR  results  are  expected  to  improve  since  the  mismatch 
conditions  for  the  desired  signal  will  be  masked  by  the  presence  of  white  noise  of  comparable  power, 
as  explained  in  Section  3.3.  MVDRi  processor  does  not  offer  a  significant  gain  due  to  the  persistent 
mismatch  condition,  but  MV  DR*  yields  a  near-optimum  result  ,  since  presence  of  higher-level  noise 
masks  the  mismatch  due  to  the  use  of  a  sample-covariance  matrix.  The  performance  of  C’UM] 
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Figure  12:  Beamforming  in  the  presence  of  spatially  colored  noise:  (a)  Spatial  Power  Spectral 
Density  of  noise,  (b)  Beampattem  of  C'UM2  processor.  The  optimum  pattern  is  illustrated 
in  dashed  lines  tor  comparison  purposes. 

processor  is  slightly  below  than  that  of  MVDR2  and  exhibits  more  variations.  This  is  due  to  the 
inefficient  use  of  the  array  data,  since  a  high-level  of  noise  corrupts  the  cumulant  estimates  and 
with  CUMi  there  are  no  precautions  to  combat  these  errors.  As  expected,  CUM2  overcomes  this 
problem  by  using  SVD.  Results  in  Table  1  indicate  that  CUM2  achieves  the  best  performance  with 
minimum  variations. 

Finally,  to  demonstrate  the  power  of  cumulant- based  beamforming,  we  illustrate  the  received 
signal  and  the  output  of  CUM2  processor  for  SNR=0  dB  case  in  Fig.  11.  It  is  clear  that  CUM2  is 
capable  of  sufficient  noise  rejection  for  performing  correct  decisions. 

3.6.2  Experiment  2:  Spatially  Colored  Noise  and  Multipath  Propagation 

In  this  experiment,  we  investigate  the  performance  of  the  proposed  approach  in  the  presence  of 
spatially  colored  noise.  We  employ  the  linear  array  of  the  previous  experiment.  We  assume  that  the 
noise  field  is  created  by  a  set  of  point  sources  distributed  symmetrically  about  the  broadside  of  the 
linear  array.  As  suggested  in  [67],  this  source  structure  is  typical  when  the  noise  field  is  spherically 
or  cylindrically  isotropic.  In  this  case,  the  noise  covariance  matrix  is  symmetric-Toeplitz.  In  our 
experiment,  we  use  the  following  structure  for  the  covariance  matrix  of  undesired  components, 

RJi.j)  =  0.8  |,-JI  (42) 

The  spatial  power  spectrum  of  undesired  components  is  illustrated  in  Fig.  12a.  It  is  clear 
that  most  of  the  noise  leaks  into  the  system  from  broadside.  The  desired  signal  illuminates  the 
array  from  broadside,  with  an  SNR  of  10  dB.  To  illustrate  the  optimum  combining  property  of  our 
approach,  we  implanted  an  exact  replica  of  the  desired  signal  illuminating  the  array  from  60°,  where 
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Fable  2:  Results  from  100  Monte-Carlo  Runs  for  Experiment  2 


Processor 

SNR„  (dB) 

Mean 

Std 

CUM, 

23.641 

0.017 

CUM  2 

23.645 

0.015 

noise  power  is  relatively  less  when  compared  to  that  from  broadside.  The  beampattern  of  CUM2 
processor  is  given  in  Fig. 12b.  For  comparison  purposes,  we  present  the  response  of  the  optimum 
beam  former  based  on  exact  statistical  information,  as  a  dashed  curve.  The  maximum-possible  SNR 
at  the  output  is  23.689  dB  for  this  scenario.  It  is  clear  that  the  response  of  CUM2  is  almost  identical 
to  that  of  the  optimum  beamformer:  both  processors  emphasize  the  signal  illuminating  the  array 
from  60".  since  the  noise  contribution  is  less  in  this  region.  We  performed  100  Monte-Carlo  runs 
for  this  scenario,  and  the  results  are  presented  in  Table  2.  It  is  clear  that  both  cumulant-based 
processors  perform  equally  well.  The  reason  for  this  phenomenon  is  the  presence  of  the  multipath 
from  60"  through  a  low-noise  background  that  virtually  increases  the  effective  SNR,  which,  in 
turn,  alleviates  the  effects  of  estimation  errors.  Note  that  the  peak  of  the  beampattern  is  slightly 
shifted  from  60",  in  order  to  receive  less  interference.  Similar  behavior  is  observed  in  covariance- 
based  direction-of-arrival  estimation  in  the  presence  of  colored  noise  resulting  in  biased  estimates 
of  parameters. 

3.6.3  Experiment  3:  Effects  of  Robustness  Constraint 

In  this  experiment,  we  illustrate  the  effects  of  the  robustness  constraint  of  Section  3.3.3,  on  a  CUMi 
processor  in  the  presence  of  white  noise.  We  employ  the  same  array  as  in  the  previous  experiments. 
We  employ  CUM , .  since  this  processor  uses  the  data  inefficiently,  and  requires  a  robust  approach.  In 
our  experiment,  we  consider  the  situation  with  SNR=0  dB.  Figure  13  illustrates  the  beampatterns 
ol  CUM)  processor  for  several  SNR,,  values.  It  is  clear  from  the  results  that,  as  the  perturbation 
increases,  the  patterns  match  better  since  the  mismatch  due  to  estimation  errors  in  the  steering 
vector  estimate  are  masked  by  the  presence  of  virtual  increased  level  of  noise.  This  method  should 
be  used  sparingly  in  the  presence  of  jammers,  because  virtually  increasing  the  noise  level  results  in 
diverting  the  capability  of  the  array  from  nulling  the  directional  interference. 

3.6.4  Experiment  4:  Multiple  Interferers 

In  this  experiment,  we  consider  the  problem  of  beamforming  in  a  multipath  environment  in  the 
presence  of  multiple  jammers.  We  employ  the  same  array  as  in  the  previous  experiments.  The 
signal  ol  interest  originates  from  a  BPSK  communication  source,  and  it  is  expected  from  broadside; 
however,  due  to  multipath  effects,  multiple  delayed  and  shifted  replicas  are  received.  There  are  two 
jammers,  and  one  is  subject  to  multipath  as  well.  Table  3.  summarizes  the  signal  structure. 

Note  that  there  are  10  wavefronts  illuminating  the  array  and  it  is  not  possible  to  estimate  their 
DOA's  with  any  existing  high-resolution  method;  hence,  signal- COPY  algorithms  [58]  can  not  be 
used  even  with  perfect  knowledge  of  the  array  manifold. 
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Figure  13:  Beam  pattern  of  CUMj  processor  for  varying  virtual  SNR:  (a)  0  dB,  (b)  -6  dB, 
(c)  -10  dB.  (d)  -20  dB.  The  optimum  pattern  is  illustrated  in  dashed  lines  for  comparison 
purposes. 


Due  to  presence  of  coherent  wavefronts,  second-order  statistics  are  not  spatially  stationary  along 
the  array;  hence,  it  is  not  meaningful  to  define  SINR.  at  an  array  element.  Instead,  we  compute  the 
SINR.  at  the  output  of  the  optimal  processor  by  employing  true  statistics.  The  maximum  possible 
SINR,,  is  found  from  (34)  to  be  12.677  dB.  From  Table  4.  we  observe  that  CUM2  performs  very 
well  under  these  severe  conditions.  Performance  of  CUM]  is  effected  by  strong  interferers  since  this 
processor  does  not  utilize  all  of  the  available  information.  Finally,  we  observe  that  MVDR  with 
correct  look  direction  cancels  the  desired  signal  due  to  coherence.  Note  that  CUM2  exhibits  less 
variations  than  other  processors. 

To  gain  more  insight  into  the  operation  of  the  processors,  we  illustrate  +he  beampatterns  for 
MV  DR  and  CUM;  in  Fig.  14.  We  focus  on  the  region  where  the  wavefronts  are  received  by  the  array. 
It  is  observed  that  the  MVDR  processor  does  not  null  the  jammer  from  —1°,  since  it  maintains  the 
look direction  constraint  for  0°  and  trie..,  to  minimize  the  output  power  by  destructively  combining 
the  coherent  wavefronts.  On  the  other  hand.  CUM;  is  blind  to  Gaussian  interferers,  and,  as  in 
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Table  3:  Signal  structure  for  Experiment  4 


Source 


BPSK 


JAMMER, 

JAMMER  2 
NOISE 


Multipath  Coeff. 
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(0.0, -0.5) 

- 

10° 
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6° 
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Figure  14:  Beampatterns  and  array  gains  of  processors:  (a)  MVDR  with  correct  look  direc 
lion.  ( 1» )  (  TM2.  I  he  optimum  pattern  is  illustrated  in  dashed  lines  for  comparison  purposes 


Table  4:  Results  from  1 00  Monte-Carlo  Runs  for  Experiment  4 


Processor 

SINR0  (dB) 

Mean 

Std 

MVDR 

-28.424 

e mu 

CUMi 

4.110 

2.118 

CUMj 

10.290 

0.746 

C2 

11.879 

0.627 

Experiment  2,  it  estimates  the  generalized  steering  vector  of  the  desired  signal  and  combines  the 
wavefronts  to  enhance  SINRat  the  output.  CUM2  putsanullon  the  jammer  from  — 1°,  destructively 
combines  the  wavefronts  from  the  first  jammer  by  weight-phasing  rather  than  null-steering,  and 
reinforces  the  wavefronts  from  the  desired  source. 

Finally,  we  implement  the  C2  beamformer  suggested  in  Section  3.3.2:  we  first  estimate  the 
steering  vector  as  done  for  CUMj,  but  then  further  project  it  into  the  subspace  spanned  by  the 
principal  eigenvectors  of  the  sample  covariance  matrix.  We  use  the  resultant  vector  as  the  estimate 
of  the  desired  signal  steering  vector,  and  construct  an  MVDR  beamformer  based  on  it.  The 
performance  of  the  resultant  processor  is  demonstrated  in  Table  4. 

We  observe  that  by  combining  cumulants  with  covariance  information,  we  obtain  the  best 
results. 

3.6.5  Experiment  5:  Adaptive  Processing 

In  this  section,  we  demonstrate  the  results  from  the  adaptive  version  of  CUMj  approach  as  described 
in  Section  3.5.  We  employ  the  10  element  uniform  linear  array  of  previous  experiments.  The  initial 
pattern  of  the  beamformer  is  designed  to  be  isotropic,  by  letting  c(0)  =  [1,0, . .  .  ,0]T.  Desired  signal 
illuminates  the  array  from  broadside  with  SNR=10  dB.  A  jammer  with  power  equal  to  that  of  the 
desired  source  is  present  at  30w.  Note  that  there  is  no  nonstationarity  involved  in  this  experiment; 
our  aim  is  to  demonstrate  the  evolution  of  the  beamforming  process  and  indicate  the  data  lengths 
required  for  cuinulaiit  and  covariance  estimation.  Tracking  properties  will  be  included  in  our  future 
work,  including  comparisons  with  adaptive  versions  of  CUM2  and  C'2  processors. 

Figure  15  illustrates  the  beampattcrn  of  the  adaptive  C’UM]  processor  as  time  evolves.  After 
100  snapshots,  the  beampattern  is  still  close  to  isotropic.  At  300  snapshots,  covariance  matrix 
estimate  is  improved,  indicating  the  presence  of  desired  signal  from  broadside.  At  this  time  point, 
the  cumulant- based  steering  vector  estimate  has  not  matured,  so  it  can  not  prevent  the  desired 
signal  from  being  cancelled.  After  500  snapshots,  cumulant  estimates  get  better,  and  there  is  a 
tendency  to  cancel  the  interference  rather  than  the  desired  signal.  Finally,  after  700  snapshots  the 
processor  removes  tin1  interference  by  null  steering. 

3.6.6  Experiment  6:  Effects  of  Data  Length 

In  this  sec  tion,  we  employ  the  linear  array  of  Experiment  1,  with  the  same  noise  conditions,  and 
vary  (he  data  length  to  observe  the  behavior  of  the  beamformers  CUMj,  CE'Mj,  MVDRj  and 
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MVDRj.  Figure  Hi  demonstrates  the  variation  of  white-noise  gain  of  the  processors  with  data 
length,  for  OdB  and  20dB  SNR  levels.  Fach  point  on  the  plots  is  obtained  by  averaging  the  results 
from  50  Monte-Carlo  simulations. 

From  F'ig.  16a  it  is  clear  that  CUMj  outperforms  all  the  processors,  including  MVDR2  which 
utilizes  the  correct  look  direction  for  all  data  lengths.  Furthermore,  small  sample  properties  of 
C F M 2  are  quite  impressive,  motivating  further  research  for  developing  its  adaptive  version.  Low 
SNR  masks  the  mismatch  in  MVDR2  due  to  the  use  of  sample  covariance  matrix;  hence,  as  can  be 
seen  from  Fig.  16a,  CUMj  is  inferior  to  MVDR2. 

Figures  16b  and  16c,  indicate  the  effect  of  higher  SNR  on  performance.  CUMi  and  CUM2 
perform  almost  identical  for  all  data  lengths.  Their  gain  is  larger  than  9  dB  even  for  less  than  50 
snapshots.  MVDR2  can  not  recover  in  this  experiment  since  the  mismatch  results  in  severe  signal 
cancellation.  We  do  not  include  the  response  of  MVDRi,  because  its  performance  drifts  around 
-55  dB. 

Tlicst  it  milts  indicate  that  our  approach  has  very  promising  small  sample  behavior  that  deserves 
mort  it  starch.  This  will  be  a  topic  of  another  paper. 


3.7  Conclusions 

We  have  presented  optimum  beamforming  algorithms  for  non-Gaussian  signals,  which  are  based 
on  fourth-order  cumulants  of  the  data  received  by  the  array.  Our  proposed  methods  do  not  make 
any  assumption  about  the  sensor  locations  and  characteristics,  i.e.,  they  are  blind  beamforming 
methods.  Cumulant-based  estimation  is  employed  to  identify  the  steering  vector  of  the  signal 
of  interest  and  MV  DR  beamforming  using  this  estimate  is  used  to  remove  Gaussian  interference 
component  s.  We  ha  ve  suggested  several  approaches  to  combat  effects  of  estimation  errors.  We  have 
also  implemented  a  recursive  version  of  the  method  to  enable  real-time  beamforming.  Simulation 
experiments  demonstrate  the  performance  of  our  approaches  in  a  wide  variety  of  situations.  It  is 
important  to  emphasize  that  the  proposed  methods  outperform  an  MVDR  beamformer  with  an 
exactly  known  look-direction. 

in  our  future  work,  we  shall  address  the  problem  of  optimum  beamforming  in  the  presence 
of  multiple  non-Gaussian  interferers  and  design  of  adaptive  algorithms  with  better  convergence 
properties. 


4  Final  Comments 

lu  this  paper,  we  summarized  our  recent  research  results  on  the  applications  of  cumulants  in  speech 
and  array  processing.  The  results  are  very  promising,  and  encourage  further  study  in  these  areas. 

We  acknowledge  that  especially  in  speech  processing,  cumulant  applications  are  still  in  a  very 
premature  state.  Array  processing,  however,  captured  more  attention,  particularly  after  the  excel¬ 
lent  work  in  [!>].  On  the  other  hand,  array  processing  has  many  practical  problems,  such  as  unknown 
sensor  gain/phase  factors,  array  shape  calibration,  and  DOA  estimation  for  coherent  sources  in  col¬ 
ored  noise.  It  is  our  aim  to  develop  cumulant  based  solutions  to  those  practical  problems  that  still 
lack  reasonable  solutions  when  only  second-order  statistics  are  employed. 
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Abstract:  The  problem  of  generalized  nonparametric  function  estimation  has 
received  considerable  attention  over  the  last  two  decades.  Most  of  the  approaches  have 
assumed  smoothness  of  the  function  to  be  estimated  generally  in  the  form  of  continuity 
of  higher  order  derivatives  and/or  bounded  variation  and  have  used  convolution  kernels 
or  splines  as  the  estimation  devices.  Generally  focus  has  been  on  density  estimation  or 
nonparametric  regression.  The  spline  and  kernel-based  methods  may  be  inappropriate  if 
either  smoothness  assumptions  are  violated  or  if  additional  side  conditions  are  present. 
Wegman  (1984)  introduced  a  general  framework  for  optimal  nonparametric  function 
estimation  which  applies  to  a  much  wider  class  of  problems  than  simply  density 
estimation  or  nonparametric  regression.  In  this  framework,  a  class  of  admissible 
estimators  is  regarded  as  a  compact,  convex  subset  of  a  Banach  function  space  and  a 
convex  objective  functional  is  to  be  optimized  over  this  set.  Recent  work  on  wavelets 
suggests  a  powerful  method  for  constructing  orthonormal  bases  to  spam  the  set  of 
admissible  estimators.  Moreover,  older  work  on  frames  has  re-emerged  to  some  level  of 
prominence  because  of  the  work  on  wavelets.  The  optimal  estimates  can  be  computed 
as  weighted  linear  combinations  of  the  orthonormal  bases.  The  weight  coefficients  are 
computed  as  moments  of  the  basis  functions.  We  illustrate  these  methods  with  some 
numerical  examples. 
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1.  Introduction. 

The  method  of  moments  is  a  time-honored  traditional  technique  in  statistical 
inference  while  wavelet  analysis  has  recently  burst  upon  the  mathematical  scene  to 
capture  the  enthusiasm  and  imagination  of  many  applied  mathematicians  and  engineers 
both  because  of  their  important  applications  in  signal  and  image  processing  and  other 
engineering  applications  and  also  because  of  the  inherent  elegance  of  the  techniques.  In 
this  paper  we  bring  these  tools  together  to  illustrate  their  application  to  transient  signal 
processing.  Wavelets  axe  described  in  detail  in  a  number  of  locations.  Much  of  the 
fundamental  work  was  done  by  Daubechies  and  is  reported  in  Daubechies,  Grossmann 
and  Meyer  (1986)  and  Daubechies  (1988).  Heil  and  Walnut  (1989)  provide  a  survey 
from  a  mathematical  perspective  while  Rioul  and  Vetterli  (1991)  provide  a  survey  from 
a  more  engineering  perspective.  The  new  book  by  Chui  (1992)  is  an  excellent  integrated 
treatment  which  I  believe  is  more  mathematically  sophisticated  than  the  author 
supposes.  In  spite  of  its  title  as  an  introduction,  it  requires  somewhat  more 
mathematical  depth  and  maturity  and  is  best  regarded  as  more  of  a  monograph. 

This  present  paper  describes  the  basic  wavelet  theory  in  the  context  of  the 
general  statistical  problem  of  nonparametric  function  estimation.  It  will  be  show  that 
traditional  moment  based  techniques  have  an  interesting  and  useful  connection  to 
modern  nonparametric  functional  inference  for  signal  processing  via  wavelets.  Wegman 
(1984)  describes  a  basic  framework  for  optimal  nonparametric  function  estimation.  This 
framework  captures  the  optimal  estimation  of  a  wide  variety  of  practical  function 
estimation  problems  in  a  common  theoretical  construct.  Wegman  (1984),  however,  only 
discusses  the  existence  of  such  optimal  estimators.  In  the  present  paper,  we  are 
interested  in  combining  this  optimality  framework  with  more  general  wavelet 
algorithms  as  computational  devices  for  general  optimal  nonparametric  function 
estimation.  A  new  application  of  optimal  nonparametric  function  estimation  is  found  in 
Le  and  Wegman  (1991).  A  second  application  will  be  discussed  in  this  paper. 

In  section  2,  we  discuss  the  optimal  nonparametric  function  estimation 
framework.  In  section  3,  we  turn  to  a  discussion  of  the  general  function  analytic 
framework  which  leads  to  bases  and  frames.  Section  4  introduces  the  notion  of  a 
wavelet  basis  and  demonstrates  the  connection  with  Fourier  series  and  Parseval’s 


272 


Theorem.  In  section  5  we  turn  to  transient  signal  estimation,  develop  an  optimization 
criterion  and  illustrate  the  computation  of  a  transient  signal  estimator. 

2.  Optimal  Nonparametric  Function  Estimation. 

Consider  a  general  function,  f(x),  to  be  estimated  based  on  some  sampled  data, 
say  x2,. ..,xn.  This  is,  in  fact,  the  most  elementary  estimation  problem  in  statistical 
inference.  Often  the  function,  f,  in  question  is  the  probability  distribution  function  or 
the  probability  density  function  and  most  frequently  the  approach  taken  is  to  place  the 
function  within  a  parametric  family  indexed  by  some  parameter,  say  9.  Rather  than 
estimate  f  directly,  the  parameter  9  is  estimated  with  f^  then  being  estimated  by  fg  =  f~. 
Under  a  variety  of  circumstances,  it  is  much  more  desirable  to  take  a  nonparametric 
approach  so  as  to  avoid  problems  associated  with  misspecification  of  parametric  family. 
This  is  particularly  the  case  when  data  is  relatively  plentiful  and  the  information 
captured  by  the  parametric  model  is  not  needed  for  statistical  efficiency. 

Probability  density  estimation  and  nonparametric,  nonlinear  regression  are 
probably  the  two  most  widely  studied  nonparametric  function  estimation  problems. 
However,  other  problems  of  interest  which  immediately  come  to  mind  are  spectral 
density  estimation,  transfer  function  estimation,  impulse  response  function  estimation, 
all  in  the  time  series  setting,  and  failure  rate  function  estimation  and  survival  function 
estimation  in  the  reliability/biometry  setting.  While  it  may  be  the  case  that  we  simply 
may  want  an  unconstrained  estimate  of  the  function,  it  is  more  often  the  case  that  we 
wish  to  impose  one  or  more  constraints,  for  example,  positivity,  smoothness,  isotonicity, 
convexity,  transience  and  fixed  discontinuities  to  name  a  few  appropriate  constraints. 
By  far,  the  most  common  assumption  is  smoothness  and  frequently  the  estimation  is  via 
a  kernel  or  convolution  smoother.  We  would  like  to  formulate  an  optimal 
nonparametric  framework. 

We  formulate  the  optimization  problem  as  follows.  Let  36  be  a  Hilbert  space  of 
functions  over  R,  the  real  numbers  (or  C,  the  complex  numbers).  For  purposes  of  the 
present  paper,  we  assume  R  rather  than  C  unless  otherwise  specified.  The  techniques  we 
outline  here  are  not  limited  to  a  discussion  of  L2(R)  although  quite  often  we  do  take  36 
to  be  L2.  In  this  case,  we  take 

<  f,  g  >  =  jf(x)  g(x)  dfi(x), 

where  p  is  Le’oesgue  measure.  We  emphasize  that  this  is  not  absolutely  required.  As 
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usual  ||  f  |!  =  s/  <  f ,  f  >  •  A  functional  R  is  linear  if 

l(af  +  /?g)  =  ai(f)  +  /?£( g),  for  every  f ,  g  €  %  and  a,  /?  e  R. 
1.  is  convex  on  S  C  %  if 


l(tf  +  (1  -t)g)  <  tl(f)  +  (1  -  t)£(g),  for  every  f ,  g  e  S  with  0  <  t  <  1. 

X  is  concave  if  the  inequality  is  reversed.  £  is  strictly  convex  (concave)  on  S  if  the 
inequality  is  strict.  £  is  uniformly  convex  on  S  if 

ti-(f)  +  (1  -  t)£(g)  -  Jt(tf  +  (1  -  t)g)  >  ct(l  - 1)  ||  f-  g  ||  2 


for  every  f,  g  €  S  and  0  <  t  <  1. 

We  wish  to  use  i.  as  the  general  objective  functional  in  our  optimization 
framework.  For  example,  if  we  are  concerned  with  likelihood,  we  may  consider  the  log 
likelihood, 

71 

1(f)  =  log  f(x-),  x,-  are  a  random  sample  from  f. 
i  =  1 

If  we  have  censored  samples  we  may  wish  to  consider 


A(g)  =  6i  lo8  g(xi)  +  5Z  l1  -  5i)  lo8  G(xi)’ 

i  =  1  i  =  1 


x,  again  a  random  sample,  Si  a  censoring  rcindom  variable,  G  =  1  -  G,  amd 

X 

G(x)  =  |  g(u)  du.  This  is  the  censored  log  likelihood.  Another  example  is  the 

—  oo 

penadized  least  squares.  In  this  case 


du. 


Here  L  is  a  differential  operator  and  the  solution  of  this  optimization  problem  over 
appropriate  spaces  is  called  a  penalized  smoothing  L-spline.  If  L  =  D2  then  the  solution 
is  the  familiar  cubic  spline. 

The  basic  idea  is  to  construct  Sc%  where  S  is  the  collection  of  functions,  g, 
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which  satisfy  our  desired  constraints  such  as  smoothness  or  isotonicity.  We  wish  tc 
optimize  1(g)  over  S.  The  optimized  estimator  will  be  an  element  of  S  and  hence  will 
inherit  whatever  properties  we  choose  for  S.  The  estimator  will  optimize  i.(g)  and 
hence  will  be  chosen  according  to  whatever  optimization  criterion  appeals  to  the 
investigator.  In  this  sense  we  can  construct  designer  estimators,  i.e.  estimators  that  are 
designed  by  the  investigator  to  suit  the  specifics  of  the  problem  at  hand. 

Of  course,  in  a  wide  variety  of  rather  disparate  contexts,  many  of  these 
estimators  are  already  known.  However,  they  may  be  proven  to  exist  in  a  general 
framework  according  to  the  following  theorem. 

Theorem  2.1: 

Consider  the  following  optimization  problem: 

Minimize  (maximize)  i.(f)  subject  to  f  e  S  C 

Then 

a.  If  56  is  finite  dimensional,  1  is  continuous  and  convex  (concave)  and  S  is  closed 
and  bounded,  then  there  exists  at  least  one  solution. 

b.  If  %  is  infinite  dimensional,  1  is  continuous  and  convex  (concave)  and  S  is 
closed,  bounded  and  convex,  then  there  exists  at  least  one  solution. 

c.  If  1  in  a.  or  b.  is  strictly  convex  (concave),  the  solution  is  unique. 

d.  If  !K.  is  infinite  dimensional,  1  is  continuous  and  uniformly  convex  (concave) 
and  S  is  closed  and  convex,  then  there  exists  a  unique  solution. 

Proof:  A  full  proof  is  given  in  Wegman  (1984).  For  completeness,  we  outline  the  basic 
elements  here.  a.  For  the  finite  dimensional  case,  S  closed  and  bounded  implies  that  S 
is  compact.  Choose  fn  €  S  such  that  i.(f„)  converges  to  inf{!.(f):  feS}.  Because  of 
compactness,  there  is  a  convergent  subsequence  fn^  having  a  limit,  say  f*.  By 
continuity  of  1 

!{(.)  =  )=inf{l(f):f«S}. 

fc-KX)  * 

f*  is  the  required  optimizer  For  part  b.,  we  have  the  same  basic  idea  except  that  S 
closed,  bounded  and  convex  implies  that  S  is  weakly  compact.  We  use  the  weak 
continuity  of  1.  Uniqueness  follows  by  supposing  both  f*  and  f**  are  both  minimizers. 
Then 

i.(tf,  +  (1  -  t)f„)  <  tJL(f,)  +  (1  -  t)i(f„)  =  inf{A(f):  f  €  S}. 
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This  implies  that  neither  f,  nor  f**  is  a  minimizer  which  is  a  contradiction.  □ 

This  theorem  gives  us  unified  fra  a*  work  for  the  construction  of  optimal 
nonparametric  function  estimators.  It  does  not,  however,  give  us  a  definitive  method 
for  construction  of  nonparametric  function  estimators.  We  give  a  constructive 
framework  in  the  next  several  sections.  In  closing  this  section  we  refer  the  reader  to 
Wegman  (1984)  for  the  complete  proof  of  Theorem  2.1  and  many  more  examples  of  the 
use  of  this  result. 

3.  Bases  and  Subspaces. 

In  this  section,  we  discuss  the  basic  theory  of  spanning  bases  and  their 
application  to  function  estimation.  Consider  f,  g  eM.  f  is  said  to  be  orthogonal  to  g 
written  f  X  g  if  <  f,  g  >  =  0.  An  element  f  is  normal  if  ||f  ||  =  1.  A  family  of  elements, 
say  (ey  A  €  A}  is  orthonormal  if  each  element  is  normal  and  if  for  any  pair  ej,  e2  in  the 
family,  ej  X  e2.  A  family  {ey  A  €  A}  is  complete  in  S  C  %  if  the  only  element  in  S  which 
is  orthogonal  to  every  e^,  A  €  A  is  0.  A  basis  or  base  of  S  is  a  complete  orthonormal 
family  in  S.  A  Hilbert  space  has  a  countable  basis  if  and  only  if  it  is  separable,  i.e.  if 
and  only  if  it  has  a  countable  dense  subset.  Ordinary  Lp  spaces  are  separable.  We  are 
now  in  a  position  to  state  the  basic  result  characterizing  bases  of  Hilbert  spaces  or 
subspaces.  We  write  span({eA})  to  be  the  minimal  subspace  containing  {e^}.  This  is 
the  space  generated  by  the  elements  {e_^}. 


Theorem  3.1: 

Let  %  be  a  separable  Hilbert  space.  If  ^  is  an  orthonormal  family  in  Dt, 

then  the  following  are  equivalent. 

a.  j  is  a  baM>  for  M. 

b.  If  f  €  K  and  f  X  e^  for  every  k,  then  f  =  0. 

OO 

c.  If  f  6  3G,  then  f  =  £  <  ^  ek  >  €k'  (orthogonal  series  expansion) 

d.  If  f,  g  G  tt,  then  <  f,  g  >  =  £  <  f,  e*  >  <  g,  e*  >  . 

e.  If  f  €  D€,  ||  f  ||  2  =  £  |  <  f,  e.  >  |  '■*.  (Parseval’s  Theorem) 

k  =  i 

Proof: 

a  =>  b:  Trivial  by  definition. 

b  =>  c:  We  claim  %  =  span(  (e^} ).  If  not  there  is  f  ^  0,  f  €  36  such  that 

f  £  span({ej.}).  This  implies  that  f  X  e^  for  every  k.  But  f  X  e^.  for  every  k  find  f  ^  0  is  a 

contradiction  to  the  {e^.}  being  a  basis.  Let  =  span(e^).  Then  =  span(  = 
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53  iP^ic-  This  implies  that  for  f  €  3t, 


(3.1)  f=£  ckek. 

irl 

Substituting  (3.1)  in  the  expression  for  the  inner  product  yields 

<  f,  >  =  <  E  k  <■*  e;  >  =  £ <  ejt-  ej  >  • 

By  the  orthonormal  property,  <e^,  e-  >  =  1,  if  k  =  j  and  =  0,  otherwise.  It  follows  that 
<  f,  ej  >  =  c-.  Thus 


(3.2)  f  =  £  <f,  ek>ek. 

k  =  i 

c  =>  d:  <  f,  g  >  =  <  f,  £  <  g,  ek  >  ek  >  =  2  <  g,  e*  >  <  f,  e*  >  . 
d  =  >  e:  Let  f  =  g  in  part  d. 

e  =>  a:  If  f  e  and  f  1  ek  for  every  k  implies  <  f,  ek>  =  0  for  every  k.  This  in 
turn  implies  that  ||  f  ||  =  0.  Thus  f  =  0.  This  finally  implies  {e^  is  a  basis.  □ 

Thus  given  any  basis  {e^}^,  we  can  exactly  write  f  =  ^  ck  ek  we  can 
estimate  f  by  £  c k  ek.  Thus  a  computational  algorithm  for  the  optimal  nonparametric 
function  estimator  can  be  based  on  this  result  from  Theorem  3.1.c.  However,  this  does 
not  yet  take  into  account  the  “design”  set,  S.  In  order  to  more  carefully  study  the 
structure  of  S  we  consider  the  following  result.  In  the  following  discussion  let  S  £  X. 
Then  define  S  =  (f  €  f  1  S}. 

Theorem  3.2: 

If  S  C  is  a  subset  of  then 

a.  S  is  a  subspace  of  %  and  S  n  S  C  {0} 

b.  S  C  S  1  1  -  span(S) 

c.  S  is  a  subspace  if  and  only  if  S  =  S  1  1 . 

Proof:  S  is  a  linear  manifold.  To  see  this  if  ft,  f2  €  S  ^  ,  then  for  every  geS, 
<  ajf  j  +  a^,  g>  =aj<fj,  g>  +  a2  <  f2,  g  >  =  aj  •  0  +  *2  •  0  =  0.  Thus  ajf  j  +  &2f2  €  S  1 . 
This  implies  S  is  a  linear  manifold  which  is  sufficient  to  show  that  S  is  a  subspace 
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provided  we  can  show  S  is  closed.  To  see  this  if  f  e  closure  (S  ^  ),  then  there  exists 
{fn}cS_L  such  that  f  =  lim  fn  and  for  every  g  €  S,  <  fn,  g  >  =  0.  But  <f,  g>  = 
lim  <  fn,  g  >  =  lim  0  =  0.  This  implies  f  _L  S  which  in  turn  implies  f  e  S  .  Part  b 
follows  from  part  a  by  replacing  S  by  S  "*■ .  Part  c  is  straightforward  application  of  the 
two  previous  parts.  □ 

Suppose  now  that  we  have  a  basis  for  3t,  call  it  {e^}^_  j.  This  basis  obviously 
also  spans  subset  S  of  %  and  hence  any  of  our  “designer”  functions  in  S  can  be  written 
in  terms  of  the  basis,  {e^}*0-,.  The  unnecessary  basis  elements  will  simply  have 
coefficients  of  0.  In  a  sense,  however,  this  basis  is  too  rich  and  in  a  noisy  estimation 
setting  superfluous  basis  elements  will  only  contribute  to  estimating  noise.  As  part  of 
our  “designer”  set,  S,  philosophy,  we  would  like  to  have  a  minimal  basis  set  for  S. 
Theorem  3.2  gives  us  a  test  for  this  condition.  Consider  a  basis  {e^}“_  x  for  D€.  Form 
B5  which  is  to  be  a  basis  for  S.  We  define  Bs  by  the  following  routine.  If  there  is  a 
g  €  S  such  that  <  g,  e^  >  /  0,  then  let  e^€B5.  If  on  the  other  hand  there  is  a 
g  e  S  -L  such  that  <g,  ek>  ^0,  then  let  e^  e  B^  ±  .  Unfortunately,  it  may  not  be  that 
B5nBsj_  =0.  But  this  algorithm  yields  {e^}  =  B5uBs  x  .  Moreover  SCspan(B5). 
Thus  we  may  be  able  to  eliminate  unnecessary  basis  elements.  We  may  also  be  able  to 
re  normalize  the  basis  elements  using  a  Gram-Schmidt  orthogonalization  procedure  to 
make  B^  ±  B^  ^  .  Usually  if  we  know  the  properties  of  the  set,  S,  we  desire  and  the 
nature  of  the  basis  set  {e^.},  it  will  be  straightforward  to  construct  a  test  function,  g, 
with  which  to  construct  the  basis  set,  B5.  If  S  is  a  subspace,  then  S  =  spanks).  In  any 
case  we  can  carry  out  our  estimation  by 

(3.3)  ?=E  Zket 

In  a  completely  noiseless  setting  (3.1)  is  really  an  equality  in  norm,  i.e. 
||  f-  £fccjfceill  =0.  If  is  L2(/i),  with  \i  Lebesgue  measure,  then  (3.1)  is  really 

(3.4)  f  =  £  almost  everywhere  fi  with  ck=  <  f,  et  >  . 

This  choice  of  c^  is  a  minimum  norm  choice.  However,  in  a  noisy  setting,  i.e.  where  we 
do  not  know  f  exactly,  we  cannot  compute  c^  directly.  However,  we  may  be  able  to 
estimate  by  standard  inference  techniques. 
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Example  3.1.  Norm  Estimate.  The  minimum  norm  estimate  of  ck  is  the  choice  which 
minimizes  ||  f  -  £  iccicek  II »  ‘•e>  ck  =  <  ^  ek  >  *  ^  ^2  c°atext, 

cf,  ejfc>  =  |f(x)  e^x)  d/i(x). 

R 

If  f  is  a  probability  density  function,  then  <  f,  eje>  =  EfeJ  which  can  simply  be 
estimated  by  n “  1  1ejfc(xj),  where  Xj,  j  =  1 ,...,«  is  the  sample  of  observations.  We 
note  that  the  major  approach  to  estimating  the  weighting  coefficients  is  via  a  traditional 
method  of  moments. 

Example  3.2.  General  Form  of  Estimate.  In  the  general  context  with  optimization 
functional  L  we  have 

(3.5)  U( )=i(  £  ctetW({ct}). 

Since  (3.5)  is  a  function  of  a  countable  number  of  variables,  {c^},  we  can  find  the 
normal  equations  and  with  the  appropriate  choice  of  basis,  find  a  solution.  For  this  we 
will  typically  assume  1  is  twice  differentiable  with  respect  to  all  cfc.  A  wide  variety  of 
bases  have  been  studied.  These  include  Laguerre  polynomials,  Hermite  polynomials  and 
other  orthonormal  systems.  Perhaps  the  most  well-known  orthonormal  system  is  the 
system  of  fundamental  sinusoids  which  span  L2(0,  2ir).  One  might  reasonable  guess 
that  wavelets  form  another  orthogonal  system.  We  discuss  the  connection  in  the  next 
section. 

4.  Fourier  Analysis  and  Wavelets. 

4.1  Bases  for  L^O,  2x). 

Let  us  consider  the  set  of  square-integrable  functions  on  (0,  2 x)  which  we  denote 
by  L2(0,  27t).  L2(0,  2tt)  is  a  Hilbert  space  and  a  traditional  choice  of  an  orthonormal 
basis  for  this  space  has  been  e^(x)  =  e'tar,  the  complex  sinusoids.  Thus  any  f  in  L2(0,2t) 
has  the  Fourier  representation  by  Theorem  3.1.c 

oo 

fW  =  £  ck e 

k  =  -  oo 


where  the  constants  c^  are  the  Fourier  coefficients  defined  by 


c*=^If(x)e  ikxdx- 

0 

This  pair  of  equations  represent  the  discrete  Fourier  transform  and  the  inverse  Fourier 
transform  and  is  the  foundation  of  harmonic  analysis.  An  interesting  feature  of  this 
complex  sinusoids  as  a  base  for  L2(0,  2ir)  is  that  ej.(x)  =  e**x  can  be  generated  from  the 
superpositions  of  dilations  of  a  single  function,  e(x)  =  etx.  By  this  we  mean  that 

ek(x)  =  e( fee),  *  =  ■••,  —  1,  0,  1, 

These  are  integral  dilations  in  the  sense  that  k  e  J,  the  integers.  The  concept  of 
dilations  of  a  fixed  generating  function  is  central  to  the  formation  of  wavelet  bases  as  we 
shall  see  shortly. 

A  well  known  consequence  of  Theorem  3.1.e  for  the  complex  sinusoid  basis  is  the 
Parseval  Theorem.  For  this  base,  we  have 

Theorem  4.1:  (Parseval’s  Theorem): 

(4.1)  l|f||2=  [  |  f(x)  | 2  dx  =  £  |cfc|2 

Equation  (4.1)  is  known  as  Parseval’s  Theorem  in  harmonic  analysis  and  states  that  the 
square  norm  in  the  frequency  domain  is  equal  to  the  square  norm  in  the  time  domain. 

While  the  space  L2(0,  2ic)  is  an  extremely  useful  one,  for  general  problems  in 
nonparametric  function  estimation  we  are  much  more  interested  in  L2(R).  We  can 
think  of  L2(0,  2x)  as  with  functions  on  the  finite  support  (0,  2ir)  or  as  periodic  functions 
on  R.  In  the  latter  case  it  is  clear  that  the  infinitely  periodic  functions  of  L2(0,  2ir)  and 
the  square  integrable  functions  of  L2(R)  are  very  different.  In  the  latter  case  the 
function,  f(x)  e  L2(R),  must  converge  to  0  as  x-*  ±  oo.  The  generating  function  e(x)  =e,z 
clearly  does  not  have  that  behavior  and  is  inappropriate  as  a  basis  generating  function 
for  L2{R).  What  is  needed  is  a  generating  function,  e(x),  which  also  has  the  property 
that  e(x)-*0  as  x-*±oo.  Thus  we  want  to  generate  a  basis  from  a  function  which  will 
decay  to  0  relatively  rapidly,  i.e.  we  want  little  waves  or  wavelets. 
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4.2  Wavelet  Bases. 

Let  us  begin  by  considering  a  generating  function  ip  which  we  will  think  of  as  our 
mother  wavelet  or  basic  wavelet.  The  idea  is  that,  just  as  with  the  sinusoids,  we  wish  to 
consider  a  superposition  of  dilations  of  the  basic  waveform  ip.  For  technical  convergence 
reasons  which  we  shall  explain  later  we  wish  to  consider  dyadic  dilations  rather  than 
simply  integral  translations.  Thus  for  the  first  pass,  we  are  inclined  to  consider 
0j(x)  =  2J^0(2J^x).  Unfortunately,  because  of  the  decay  of  ip  to  0  as  x-*±oo,  the 
elements  {tpj}  are  not  sufficient  to  be  a  basis  for  L2(R).  We  accommodate  this  by 
adding  translates  to  get  the  doubly  indexed  functions  ipj  k(x)  =  2^^ip(2^x- k).  We 
choose  ip  such  that 


f  I0M!2 

u> 

J  R 


dw  exists. 


Here  ip  is  the  Fourier  transform  of  ip.  Under  certain  choices  of  ip,  ip;  l  forms  a  doubly 
indexed  orthonormal  basis  for  L2  (actually  also  for  Sobolev  spaces  of  higher  order  as 
well).  As  we  shall  see  in  the  next  section,  a  wavelet  basis  due  to  the  dilation-translation 
nature  of  its  basis  elements  admits  an  interpretation  of  a  simultaneous  time-frequency 
decomposition  of  f.  Moreover  using  wavelets,  fewer  basis  elements  are  required  for 
fitting  sharp  changes  or  discontinuities.  This  implies  faster  convergence  in  “non¬ 
smooth”  situations  by  the  introduction  of  “localized”  basis  elements. 


Example  3.1  Continued:  Notice  that 

cj  k^  <  f,  ip  jk>  =|  2  2jx  -  k)  f(x)  dx. 

In  the  density  estimation  case 

c^=E(2y/M24-k)} 

Thus  a  natural  estimator  is 


c  -2>/2 
ci.k- - 


n 


XM^x.-k), 


i  =  i 


where  x,,  i  =  l,...,n  is  the  set  of  observations.  Again  we  are  simply  using  a  method 
moments  estimator. 

Notice  that  we  can  construct  a  Parseval’s  Theorem  for  Wavelets. 


of 
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Theorem  4.2:  (ParsevaTs  Theorem  for  Wavelets) 


f  OO  ( 

=  I  |  f(x)  | 2  dx  =  ] 

—  OO  •»'  — 


J  =  -oo 


*  =  —  OO 


=  E  £  is*1 

k=-ooj=-oo 


At  this  stage  we  are  left  with  the  problem  of  constructing  an  appropriate  mother 
wavelet,  0,  suitable  for  constructing  the  basis.  To  do  this  we  turn  to  the  device  of 
multiresolution  analysis. 


4.3  Multiresolution  Analysis. 

To  understand  multiresolution  analysis  let  us  first  consider  the  construction  of 
space  Wj  =  span{0j  A;gJ}.  That  is  we  fix  the  dilation  and  consider  the  space 
generated  by  all  possible  translates.  We  may  write  L2(R)  as  a  direct  sum  of  the  W j, 
Lo(R)  =  ^  W  •  so  that  any  function  f  G  L2(R)  may  be  written  as 
3  6  J  f(x)  =  •  •  •  +  d  _  j(x)  +  d0(x)  +  dj(x)  +  ■  •  • 

where  d^  e  Wj.  If  0  is  an  orthogonal  wavelet,  then  Wj  i.  Wt,  k^j.  We  shall  assume 
the  unknown  0  to  be  an  orthogonal  wavelet  in  what  follows.  Notice  that  as  j  increases, 
the  basic  wavelet  form  0( 2Jx  -  k)  contracts  representing  higher  “frequencies.”  For  each 
j  we  may  consider  the  direct  sum  Vj  given  by: 

Vj  =  -  +  Wi-J  +  Wi_1.  wm. 

m  =  -  oo 

The  Vj  are  closed  subspaces  and  represent  spaces  of  functions  with  all  “frequencies”  at 
or  below  a  given  level  of  resolution.  The  set  of  spaces  |Vj|  has  the  following  properties: 

1)  They  are  nested  in  the  sense  that  Vj  C  Vj  +  j,  j€  J. 

2)  Closure  (ujg  jVj)  =  L2(R). 

3)  n  j  g  jVj  =  {0}- 

4)  Vj+1=Vj  +  Wj. 

5)  f(x)  g  Vj  if  and  only  if  f(2x)  g  Vj  +  j,  j  g  J. 

1),  4)  and  5)  follow  directly  from  the  definition  of  Vj.  2)  is  a  straightforward  conse¬ 
quence  of  the  fact  that  Wj  =  L2(R).  3)  follows  because  of  the  orthogonality 

property. 

Any  f  g  L2(R)  can  be  projected  into  Vj.  As  we  have  seen  with  j  increasing  the 
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the  “frequency”  of  the  wavelet  increases  which  can  be  interprets!  as  higher  resolution. 
Thus  the  projection,  Pjf,  of  f  into  V  •  is  an  increasingly  higher  resolution  approximation 
to  f  as  j-Kx.  Conversely,  as  j-+-oo,  Pjf  is  an  increasingly  blurred  (smoothed)  approxi¬ 
mation  to  f.  We  shall  take  Vq  as  the  reference  subspjce.  Suppose  now  that  we  can  find 
a  function  </>  and  that  we  can  define  <)>■  ^(x)  =  2^^<j>(2^x  -  k)  such  that 

V0  =  span {<f>o  k:  ke  J}. 

Then  by  property  5),  =  span{<£j  k:  ke  J}.  While  we  began  our  discussion  with  the 

notion  of  wavelets  and  have  seen  some  of  the  consequences,  we  could  have  actually 
begun  a  discussion  with  the  function  <f>. 

Definition.  A  function  4>  generates  a  multiresolution  analysis  if  it  generates  a  nested 
sequence  of  spaces  having  properties  1),  2),  3)  and  5)  such  that  {<£q  ke  J}  forms  a 
basis  for  Vq.  If  so,  then  <f>  is  called  the  scaling  function. 

For  the  final  discussion  of  this  section,  let  us  consider  a  multiresolution  analysis 
in  which  { V;-}  are  generated  by  a  scaling  function  <j>  e  L2(R)  and  {W^}  are  generated  by 
a  mother  wavelet  function  ^gL2(R).  Any  function  f  G  L2(R)  can  be  approximated  as 
closely  as  desired  by  fm  for  some  sufficiently  large  meJ.  Notice  fm  =.  fm  _  i  +  dm_  i 
where  fm  _  j  G  Vm  _  j  and  dm  _  t  G  Wm  _  j.  This  process  can  be  recursively  applied  say  / 

times  until  we  have  f  =  f m  =  dm  _  j  +  dm  _  ^  > - 1-  dm  _  /  +  fm  _  Notice  that  f  _  ^  is  a 

highly  smoothed  version  of  the  function.  Indeed,  this  suggests  that  a  statistical 
procedure  might  be  to  form  a  highly  smoothed  (even  overly  smoothed)  approximation 
to  a  function  to  be  estimated.  The  sequence  dm  _  j  through  d^  »  form  the  higher 
resolution  wavelet  approximations.  Many  of  the  wavelet  coefficients  cm_-  ^  used  for 
constructing  dm_  ■,  i—  1,...,  I  axe  likely  to  be  0  and  hence  can  contribute  to  a  very 
parsimonious  representation  of  the  function  f.  Indeed,  a  wavelet  decomposition  is  a 
natural  suggestion  for  a  technology  for  high  definition  television  (HDTV).  If  f m_/ 
represents  the  lower  resolution  conventional  NTSC  TV  signal,  then  to  reconstruct  a 
high  resolution  image  all  that  is  needed  is  the  difference  signal  which  could  be 
parsimoniously  represented  by  the  wavelet  coefficients  cm_,-  »=  I,...,/  and  ke  J,  most 

of  which  would  be  0. 

Most  importantly,  however,  is  the  observation  that  the  scaling  function  <f>  e  Vq 
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and  the  mother  wavelet  t/>  €  Wq  implies  that  both  are  in  Vj.  Since  Vj  is  generated  by 
< t t(x)  =  2i^4>(2 x-k),  there  are  sequences  {g(&)}  and  (h(&)}  such  that 

(4.3)  0(x)  -  g(k)<f>(2x  -  k)  and  0(x)  =  £  h(k)<f>(2x  -  fc). 

keJ  keJ 

This  remarkable  result  gives  us  a  construction  for  the  mother  wavelet  in  terms  of  the 
scaling  function.  These  equations  are  called  the  two-scale  difference  equations.  We  can 
give  a  time  series  interpretation  to  these  equations.  Lets  consider  an  original  discrete 
time  function,  f(n),  to  which  we  apply  the  filter 

y(n)  =  II  g{k)i(2n-k). 
keJ 

First  of  all  we  note  that  there  is  a  scale  change  due  to  subsampling  by  two,  i.e.  a  shift 
by  two  in  f(n)  results  in  a  shift  of  one  in  y(n).  The  scale  of  y  is  only  half  that  of  f. 
Otherwise  this  is  a  low  pass  filter  with  impulse  response  function  g.  Let  us  consider 
iterating  this  equation  so  that 

(4.4)  y(j)(n)=:  g(*)y(j  "  I)(2n- k). 

keJ 

Notice  that  if  this  procedure  converges,  it  converges  to  a  fixed  point  which  will  be  <J>. 
This  iterative  procedure  with  repeated  down  sampling  by  two  is  suggestive  of  a  method 
for  constructing  wavelets.  If  g  is  a  finite  impulse  response  (FIR)  filter  of  length  l,  the 
construction  of  a  complementary  high-pass  filter  is  accomplished  with  a  FIR  filter,  h, 
whose  impulse  response  is  given  by  h(I  -  1  -  n)  =  (  -  l)n  g(n).  This  scheme  is  called  sub¬ 
band  coding  in  the  electrical  engineering  literature.  The  low-pass  band  is  given  by 

(4.5)  y0(«)  =  Y,  g(k)f(2ri-*) 

keJ 

while  the  high-pass  band  is  given  by 

(4.6)  y,(n)=£h(*)f(2n-*). 

te  J 

The  filter  impulses  as  defined  form  an  orthonormal  set  so  that  the  f  may  be 
reconstructed  by 


(4.7) 


f(»)  =  H  [yo(*)s(2*- ”)  +  yi(*)t(2*- «)]• 

k  eJ 

The  sub- band  coding  scheme  may  be  repeatedly  applied  to  form  the  nested  sequence  pf 
V  •.  The  nested  sequence  of  {Vj}  is  then  essentially  obtained  by  recursively 
downsampling  and  filtering  a  function  with  a  low-pass  filter  whose  impulse  response 
function  is  g(  • ). 

4.4  Construction  of  Scaling  Functions  and  Mother  Wavelets. 

We  have  already  hinted  that  the  scaling  function  may  be  constructed  as  the 
fixed  point  of  the  down-sampled,  low-passed  filter  equation  (4.4).  This  can  be 
formalized  by  considering  what  statisticians  would  call  the  generating  function  of  g(n) 
and  what  electrical  engineers  call  the  z- transform  of  g(  • ). 

(4-8)  G(z)  =  1  £  SO)  zJ- 

j  e  J 

Notice  if  z  =  e  ~ ,w/2,  then  (4.8)  is  essentially  the  Fourier  transform  of  the  impulse 
response  function  g(  • ).  In  this  case,  the  first  equation  in  (4.3)  may  be  written  as 

(4.9)  4>{u)  =  G(z)<^),  with  z  =  e  ~  *w/2. 

This,  of  course,  follows  because  the  Fourier  transform  of  a  convolution  is  the 
corresponding  product  of  the  Fourier  transforms.  This  recursive  equation  may  be 
iterated  to  obtain 

(4.10)  JM  =  ft  G(e-iu,/2*U(0). 

*=  l 

We  may  take  <f>  to  be  continuous  and  ^(0)  =  1.  Based  on  (4.10)  we  may  recover  4>(-) 
and  based  on  this  result,  the  equation  h(/-  1  -  n)  =  (  -  l)n  g(n)  and  the  second  equation 
of  (4.3)  we  may  recover  the  mother  wavelet,  t/>(-).  Thus  Daubechies’  original 
construction  shows  that  wavelets  with  compact  support  can  be  based  on  finite  impulse 
response  filters  which  was  originally  motivated  by  multiresolution  analysis.  Theorem 
4.3  below  summarizes  the  general  form  of  Daubechies’  result. 
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Theorem  4.3:  (Daubechies*  Wavelet  Construction): 

Let  g(»)  be  a  sequence  such  that 

a.  £  l  g(  n) )  |  n  | c  <  oo  for  some  e  >  0, 
n£j 

b-  £  g(n  -  2 j)  g(n  -  2*)  =  <5  -t, 

n  e  / 

c.  £  g(n)  =  l. 
n£J 

Suppose  that  g(u>)  =  G(e  ~ ,ul/2)  =  2  ~  £  g(n)  e  ~ ,nw/^  can  be  written  as 

n  6  J 

g(u;)  =  [I(l+e",WV]-[  £  f(n)e-*'nw/2] 

,  n£j 

where 

d.  £  |  f(n)  |  |  n  | 1  <  oo  for  some  e  >  0 

e-  suPu,  €  R  I  £  nf(n)  e  “  I  <  2^  “ 

Define 

h(n)  =  (-l)n  g(-n+l), 

0M  =  £  h(fe)0(2x  — *). 
k£J 

Then  the  orthonormal  wavelet  basis  is  V’jfc  determined  by  the  mother  wavelet  0. 
Moreover,  if  g(n)  =  0  for  |n|  >  then  the  wavelets  so  determined  have  compact 
support.  □ 

We  state  this  result  without  proof  which  may  be  found  n  Daubechies  (1988).  We 
note  that  Daubechies  also  shows  that  the  mother  wavelet,  0,  cannot  be  an  even  function 
and  also  have  a  compact  support.  The  exception  to  this  is  the  trivial  constant  function 
which  gives  rise  to  the  so-called  Haar  basis.  Daubechies  illustrates  this  computation 
with  the  example  of  g  given  by  g(0)  =  (1  +  \/3)/8,  g(l)  =  (3  +  \/3)/S,  g(2)  =  (3-y/3)/S 
and,  finally,  g(3)  =  (1  -  %/3)/8.  This  wavelet  is  illustrated  in  Figure  4.1. 
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5.  Transient  Signal  Function  Estimation. 

Now  with  the  basic  construction  of  wavelets  in  hand,  we  can  turn  to  the 
transient  signal  processing  application.  Wavelets  have  as  one  of  their  prime 
applications  transient  signal  processing.  In  particular,  since  the  most  effective  wavelets 
are  those  with  compact  support,  they  are  a  natural  basis  for  transient  signal  estimation. 
However,  if  we  are  to  exploit  them  in  the  context  of  optimal  nonparametric  function 
estimation,  we  must  construct  an  optimality  criterion  for  transient  signals.  The 
discussion  below  outlines  an  approach  to  transient  signal  estimation  set  in  the  context  of 
optimal  nonparametric  function  estimation.  A  fuller  treatment  can  be  found  in  Le  and 
Wegman  (1992).  We  first  consider  signals.  It  is  well-known  that  there  is  no  non-zero 
function  in  L2(R)  which  is  both  band-limited  and  time-limited.  This  being  the  case,  we 
will  assume  the  signal  to  be  hard  band-limited,  i.e.  with  no  energy  outside  a  fixed 
interval,  say  [-v,  i/J,  but  soft  time-limited,  i.e.  with  minimal  energy  in  the  tails.  This 
particular  example  demonstrates  an  elegant  application  of  moments  to  signal  processing. 

5.1  Measuring  of  Out-of-Band  Energy 

Let  L2(R)  be  the  set  of  square-integrable,  real-valued  functions  and  let 
h(t)eL2(R).  Denote  by  f(u)  the  Fourier  transform  of  f(t)  such  that  f  £  L2(R).  We 

A.  AS 

assume  f  is  frequency  band-limited  so  that  f(u>)  =  0,  for  \u)\>v.  We  propose 
approximating  the  class  of  band-limited  time-transient  functions  by  considering 
functions  whose  energy  time  spread  is  confined  to  some  small  level  s0.  As  a  measure  of 
the  energy  time-spread,  we  will  use  analogies  to  concepts  from  probability  theory  to 
define  various  moments  of  |f(t)|2,  which  plays  the  role  of  the  energy  distribution 
function.  Assuming  that 

OO 

|  1 1 1  >  | f(t)  | 2  dt  <  oo,  j  =  1,  2, ... ,  k, 

—  OO 

the  k<A  moment  of  the  energy  distribution  will  now  be  defined  as  follows 

OO 

M*  =  |  t*  |f(t)|2dt. 

— OO 

For  k  =  2,  we  have  the  2nd  moment  of  the  energy  distribution  function  as  a  measure  of 
the  energy  time  spread,  given  as 
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M2=  j  t2  |f(t)|2dt. 

— oo 

Remark:  The  factor  t*  serves  as  a  weight  on.  the  energy  function  which  is  used  to 
control  the  degree  of  spreading  in  |f(t)  | .  A  larger  k  value  implies  that  more  weight  is 
applied  at  the  tail-end  of  the  energy  distribution  function  and,  therefore,  the  process  of 
minimizing  requires  that  more  energy  be  centrally  concentrated. 

5.2  Optimal  Estimation  of  Band- Limited  Processes 

For  -  v  and  v  real  numbers,  and  m  and  p  integers,  where  -oo  <  -  v  <  v  <  oo,  and 
m>0  and  p  >  1,  the  Sobolev  space  Wm,p[-i/,  v\  of  complex- valued  functions  f  on 
[  - 1/,  i/]  is  given  by: 

VTm'p[-u,  u]  ={f(w):  f(fc)(w)  ,  k  s=  0,  1,  ...  ,  m-1,  axe  absolutely  continuous 

v 

and,  |  |  f(m)(u>)|p  du;  <  oo}. 

—  v 

We  consider  observing  an  actual  process,  r(t),  and  we  let  r(u>)  be  the  Fourier  transform 
of  the  observed  process,  r(t).  The  Fourier  transform  of  the  observed  process,  r(t),  will 
then  be  modeled  as  f(w)  =  g(cj)  +£(u>)  where,  £(u>)  is  the  spectrum  of  a  stationary  noise 
process,  g(u>)  €  <Wm'2[-v,  u]  .  The  fact  that  f  belongs  to  the  class  *V'm,2[ -  v,v\  of  band- 
limited  signals  implies  that  the  support  of  j  f(t)  |  2  is  not  bounded.  The  objective  is, 
then,  to  find  a  function  f(w)  g  Vm,2[-  v,i/]  which  best  fits  the  Fourier  transform  r(w)  of 
the  observed  process  r(t)  with  minimum  time-energy  spread;  specifically  we  would  like 
to  minimize  the  following  functional  with  k  <  m 

(5.1)  min  [y^(f(a>,)-f(u>,))2]  subject  to  f  t2fc  |  f(t)  |  2  dt  <  s0, 

fer**  2[-u,u]  j=i  J 

where  f(t)  is  the  inverse  Fourier  transform  corresponding  to  f(a/)  in  Vm,2[-v,  u]. 

5.3  Moment  Connection  via  Parseval’s  Theorem 

A  rather  elegant  extension  of  Parseval’s  Theorem  can  be  constructed  under 
appropriate  regularity  conditions.  The  Parseval’s  Theorem  for  continuous  Fourier  trans¬ 
form  pairs  is 
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But  we  know 


[  |f(w)|2  du>  =  ^-[  I  f(t)  | 2  dt. 

1  —  V  *  -  oo 

-  i  r°° 

'w-if.jw*-' dt- 

Take  k^  derivatives  with  respect  to  cj 

so  that 

f^(u>)  =  —  is  the  Fourier  transform  of  (  -  it)*  f(t). 

duk 

We  can  apply  Parseval’s  Theorem  to  this  Fourier  transform  pair  to  obtain 
Theorem  5.1: 

|  ^  |  ({k\u)  |  2  du;  =  i  | ' “  J2*  |  f(t )  | 2  dt.  □ 

Thus,  our  optimization  problem  (5.1)  can  now  be  reformulated  as 

v 

(5.2)  ___  min  [  (f(u^)  -  r(Wj))2  ]  subject  to  f  |  f^(u>)  | 2  da;  <  sj. 

feVra,2[-M  J~l  -v 

Using  standard  Lagrange  multiplier  techniques,  this  in  turn  may  be  reformulated 
as 

(5.3)  ^  min  [£  (f(w;)  -  r(w^))2  +  A  [  |  f1 [fc)(u>)  |  2  du;  ]. 

f  6  Vm’2[-t/,  v]  j=  1  lv 

Indeed  expression  (5.3)  is  the  form  of  optimization  problem  which  results  in  a  solution 
which  is  a  generalized  polynomial  spline  of  degree  2fc-l.  This  result  may  be 
substantially  generalized  by  the  theorem  given  below  which  is  developed  in  Le  and 
Wegman  (1992). 

Theorem  5.2:  Let  g(u>)  be  a  band-limited  spectral  process  with  transient  inverse  Fourier 
transform  and  f(w)  be  the  observed  spectral  process  defined  over  some  finite  band 
-  v  <  u>  <  v.  We  model  this  spectral  process  as 
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r(u>)  =g(u)  +  £(u>) 


where  £(u>)  is  some  stationary  white  noise  process.  Let  A  be  the  time  spread  measure, 
defined  as  follows: 

A(f)  =  a0AJ(f)  +  a1Ait(f) 
where,  +(x> 

A»(f)  =  £f‘“lf(t)l2dt, 

— oo 

and  where  a,,  and  a!  are  the  appropriately  chosen  weights.  Here  £  is  the  inverse  Fourier 
transform  of  f  belonging  to  L2(R).  Then,  the  optimal  band-limited  representation  in  the 
Sobolev  space  <VPm,2[  is  f A(w)  where  fA(u;)  is  the  solution  to  the  problem: 


minimize  ^  [f (ui,)  -  f (u>,)]2  +  AA(f)  . 
f  6Vm’2[-1/,i/]  j=i 

f  A  is  a  generalized  L-spline,  and  A  is  known  as  the  smoothing  parameter.  □ 

For  a  general  discussion  of  L-splines,  see  Wegman  and  Wright  (1983).  Notice 
that  if  A(f)  =  Afc(f )  for  some  large  k,  then  we  are  constructing  a  band-limited  transient 
signal  estimator  with  little  energy  in  the  tail  of  the  signal  estimate,  fA,  where  fA  is  the 
inverse  Fourier  transform  of  f  A.  If  k  =  2,  then 

+oo  1/ 

A,(f )  =  ^  1 1*  I  f(t)  1 2  dt  =  |  I  f 1 1 2  iu, 


and  our  solution  is  the  well-known  cubic  spline.  However,  much  more  interesting  and 
physically  meaningful  solutions  may  be  found.  If  A(f )  =  a0A0(f )  +  atAt(f),  then  for  k 


odd 


if(t)i2dt+ 


a2f  t2‘|f(t)| 

—  OO 


Thus,  we  may  also  want  to  impose  a  total  energy  restriction  on  the  estimated  signal 
space.  This  imposed  restriction  may,  for  example,  have  resulted  from  a  requirement  to 
minimize  channel  bandwidth  utilization  from  data  transmission  systems.  Such 
modification,  thus,  yields  the  following  optimization  problem  for  k  odd 


^  min  [  ^  (*(wi)  “  ?(w>))2  +  Ai  f  I  I2  +  x2  f  I  f(t)M  I2  dw  ]• 

f  e  i=i  -v  -v 

Hence,  by  our  theorem  the  optimal  solution  is  again  an  L-spline. 

5.4  Computing  Band-limited  Transient  Estimators  and  Example 

The  rather  elegant  result  that  our  band-limited  transient  estimators  are 
generalized  L-splines  makes  the  numerical  computation  of  the  estimators  rather  more 
routine  since  algorithms  already  exist  for  computing  L-splines.  The  fact  that  we  can 
impose  total  energy  limits  as  well  as  tail-energy  limits  is  am  unexpected  bonus.  Our 
interpretation  of  Theorem  5.2  is  as  follows.  We  recommend  doing  an  initial  spectral 
estimation  to  establish  the  bandwidth,  -v  <uj  <v,  over  which  we  waint  to  estimate  g(u) 
(or  more  precisely  the  signal,  g(t),  its  inverse  Fourier  transform).  This  initial  spectral 
estimate  will  also  allow  us  to  select  the  sampling  frequencies,  a/,.  We  recommend 
selecting  these  ojj  as  the  frequencies  with  the  largest  spectral  mass.  Notice  that  we  may 
regard  a  transient  signal,  g(t),  as  the  product  of  a  signal  of  infinite  support  with  an 
indicator  function  of  a  closed  interval.  It  is  well-known  that  Fourier  transform  of  an 
indicator  function  is  the  so-called  Dirichlet  kernel  which  has  a  laxge  central  lobe  and 
decreasing  side  lobes.  By  choosing  sampling  frequencies  u>j  at  the  location  of  the  central 
and  side  lobes,  our  technique  allows  us  to  to  recover  the  indicator  to  an  excellent 
approximation.  Thus  not  only  do  we  estimate  the  transient  signal  because  of  the 
penalty  term  for  out-of-band  energy,  but  because  of  the  choice  of  sampling  frequencies 
as  well.  Figure  5.1  graphically  illustrates  the  results  of  our  technique. 
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Figure  5.1c.  Recovery  of  two-cycle  signal  waveform 
by  optimal  band-limited  techniques. 
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